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Preface 


This  volume  consists  of  the  invited  talks,  papers  and  posters  presented  during 
the  VECPAR'98  -  3rd  International  Meeting  on  Vector  and  Parallel  Processing. 
The  meeting,  organised  by  the  FEUP  -  Faculdade  de  Engenharia  da  Universi- 
dade  do  Porto  (Faculty  of  Engineering  of  the  University  of  Porto),  is  held  at 
Funda9ao  Dr.  Antonio  Cupertino  de  Miranda,  in  Porto  (Portugal),  from  21  to  23 
June,  1998. 


VECPAR’98  is  the  third  in  a  series  of  VECPAR  meetings  initiated  in  1993 
(VECPAR'93,  VECPAR'96)  on  vector  and  parallel  computing.  The  format  of 
previous  meetings  was  preserved  and  it  was  organised  around  scientific 
sessions  initiated  by  thematic  key  invited  lectures,  followed  by  contributed 
papers.  The  66  papers  and  20  posters  presented  at  the  conference  were  the 
result  of  a  selection  from  more  than  1 20  extended  abstracts  originated  from  27 
countries. 


It  is  our  great  pleasure  to  express  our  gratitude  to  all  people  that  helped  us 
during  the  preparation  of  this  event,  and  in  particular  to  the  members  of  the 
Scientific  Committee.  Without  their  collaboration  and  prompt  reviews  it  would 
have  been  impossible  to  fulfill  the  deadlines  imposed  by  the  organisation.  Also, 
with  the  contribution  and  comments  of  the  Scientific  Committee,  the  authors 
had  the  opportunity  to  improve  the  original  versions  of  their  papers. 

We  are  very  grateful  to  all  sponsors  for  their  support,  without  which  the 
VECPAR'98  would  not  have  been  possible. 
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Abstract.  We  survey  some  unusual  eigenvalue  problems  arising  in  dif¬ 
ferent  applications.  We  show  that  all  these  problems  can  be  cast  as 
problems  of  estimating  quadratic  forms.  Numerical  algorithms  based  on 
the  well-known  Gauss-type  quadrature  rules  and  Lanczos  process  are  re¬ 
viewed  for  computing  these  quadratic  forms.  These  algorithms  reference 
the  matrix  in  question  only  through  a  matrix-vector  product  operation. 
Hence  it  is  well  suited  for  large  sparse  problems.  Some  selected  numerical 
examples  are  presented  to  illustrate  the  efficiency  of  such  an  approach. 


1  Introduction 

Matrix  eigenvalue  problems  play  a  significant  role  in  many  areas  of  computa¬ 
tional  science  and  engineering.  It  often  happens  that  many  eigenvalue  problems 
arising  in  applications  may  not  appear  in  a  standard  form  that  we  usually  learn 
from  a  textbook  and  find  in  software  packages  for  solving  eigenvalue  problems. 
In  this  paper,  we  described  some  unusual  eigenvalue  problems  we  have  encoun¬ 
tered.  Some  of  those  problems  have  been  studied  in  literature  and  some  are  new. 
We  are  particularly  interested  in  solving  those  associated  with  large  sparse  prob¬ 
lems.  Many  existing  techniques  are  only  suitable  for  dense  matrix  computations 
and  becomes  inadequate  for  large  sparse  problems. 

We  will  show  that  all  these  unusal  eigenvalue  problems  can  be  converted  to 
the  problem  of  computing  a  quadratic  form  f[A)u,  for  a  properly  defined 
matrix  A,  a  vector  u  and  a  function  /.  Numerical  techniques  for  computing  the 
quadratic  form  to  be  discussed  in  this  paper  will  based  on  the  work  initially 
proposed  in  [6]  and  further  developed  in  [11.12,2],  In  this  technique,  we  first 
transfer  the  problem  of  computing  the  quadratic  form  to  a  Riemann-Stieltjes 
integral  problem,  and  then  use  Gauss-type  quadrature  rules  to  approximate  the 
integral,  which  then  brings  the  orthogonal  polynomial  theory  and  the  underlying 
Lanczos  procedure  into  the  scene.  This  approach  is  well  suitable  foi  laige  sparse 
problems,  since  it  references  the  matrix  A  through  a  user  provided  subroutine 
to  form  the  matrix- vector  product  Ax. 

The  basic  time-consuming  kernels  for  computing  quadratic  forms  using  par¬ 
allelism  are  vector  inner  products,  vector  updates  and  matrix-vector  products, 
this  is  similar  to  most  iterative  methods  in  linear  algebra.  Vector  inner  prod¬ 
ucts  and  updates  can  be  easily  parallelized:  each  processor  computes  the  vector- 
vector  operations  of  corresponding  segments  of  vectors  (local  vector  operations 


FEUP  -  Faculdade  de  Engenharia  da  Universidade  do  Porto 


(LVOs)),  and  if  necessary,  the  results  of  LVOs  have  to  sent  to  other  processors 
to  be  combined  for  the  global  vector-vector  operations.  For  the  matrix-vector 
product,  the  user  can  either  explore  the  particular  structure  of  the  matrix  in 
question  for  parallelism,  or  split  the  matrix  into  strips  corresponding  to  the 
vector  segments.  Each  process  then  computes  the  matrix-vector  product  of  one 
strip.  Furthermore,  the  iterative  loop  of  algorithms  can  be  designed  to  overlap 
communication  and  computation  and  eliminating  some  of  the  synchronization 
points.  The  reader  may  see  [8,4]  and  references  therein  for  further  details. 

The  rest  of  the  paper  is  organized  as  follows.  Section  2  describes  some  unusual 
eigenvalue  problems  and  shows  that  these  problems  can  be  converted  to  the 
problem  of  computing  a  quadratic  form.  Section  3  reviews  numerical  methods  for 
computing  a  quadratic  form.  Section  4  shows  that  how  these  numerical  methods 
can  be  applied  to  those  problems  described  in  section  2.  Some  selected  numerical 
examples  are  presented  in  section  5.  Concluding  remarks  are  in  section  5. 


2  Some  Unusual  Matrix  Eigenvalue  Problems 


2.1  Constrained  eigenvalue  problem 

Let  A  be  a  real  symmetric  matrix  of  order  N ,  and  c  a  given  N  vector  with 
c^c  =  1.  We  are  interested  in  the  following  optimization  problem 


max  x'^  Ax  (1) 

T 


subject  to  the  constraints 


x'^x  =  1 


(2) 


and 

X  =  0. 


(3) 


Let 


(p(x,  A,  n)  =  x'^ Ax  —  X{x^x  —  1)  +  2^r^c, 


(4) 


where  X,  n  are  Lagrange  multipliers.  Differentiating  (4)  with  respect  to  x.  we  are 
led  to  the  equation 

.4a.'  —  Xx  +  fic  =  Q. 


Then 


X  =  —f(A  —  XI)  ^c. 


I'sing  the  constraint  (3),  we  have 


c^(.4-A/)-ic  =  0. 


An  equation  of  such  type  is  referred  as  a  secular  equation.  Now  the  problem 
becomes  finding  the  largest  A  of  the  above  secular  equation. 

We  note  that  in  [10],  the  problem  is  cast  as  computing  the  largest  eigenvalue 
of  the  matrix  PAP.  where  P  is  a  project  matrix  P  —  I  —  cc^ . 
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2.2  Modified  eigenvalue  problem 

Let  us  consider  solving  the  following  eigenvalue  problems 

Ax  =  \x 


and 

{A  +  cc^)x  =  Ax 

where  A  is  a  svmmetric  matrix  and  c  is  a  vector  and  without  loss  of  generality, 
we  assume  c^c=L  The  second  eigenvalue  problem  can  be  regarded  as  a  modifed 
or  perturbed  eigenvalue  problem  of  the  first  one.  We  are  interested  in  obtaining 
some,  not  all,  of  the  eigenvalues  of  both  problems.  Such  computation  task  often 
appears  in  structural  dynamic  (re-)analysis  and  other  applications  [5]. 

By  simple  algebraic  derivation,  it  is  easy  to  show  that  the  eigenvalues  A  of 
the  second  problem  satisfy  the  following  secular  equation 

l  +  c^(A-Af}-^c  =  0.  (6) 


2.3  Constraint  quadratic  optimization 

Let  .4  be  a  symmetric  positive  definite  matrix  of  order  N  and  c  a  given  A'  vector. 
The  quadratic  optimization  problem  is  stated  as  the  following: 

min  Ax  — “Ic^x  (~) 

T 

with  the  constraint 


where  o  is  a  given  scalar.  Now  let 

(p{x,  A)  =  x^ Ax  —  2(F x  +  X{x^ X  —  Q~)  (9) 

where  A  is  the  Lagrange  multiplier.  Differentiating  (9)  with  respect  to  ,r,  we  are 
led  to  the  equation 

(.4  +  A/)x  -  c  =  0 

By  the  constraint  (8).  we  are  led  to  the  problem  of  determining  A  >  0  such  that 

c'^(A  +  Xir-c  =  a\  (10) 

Furthermore,  one  can  show  the  existence  of  a  unique  positive  A*  for  which  the 
above  equation  is  satisfied.  The  solution  of  the  original  problem  (7)  and  (8)  is 
then  X*  =  (.4  4-  A'/)“^c. 
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2.4  Ti'ace  and  determinant 

The  trace  and  determinant  problems  are  simply  to  estimate  the  quantities 

«=i 

and 

det(.4) 

for  a  given  matrix  A.  For  the  determinant  problem,  it  can  be  easily  verified  that 
for  a  symmetric  positive  definite  matrix  A: 

n 

ln(det(.4))  =  tr(ln(yi))  =  ^  ef  (ln(^))ei .  (11) 

»=i 

Therefore,  the  problem  of  estimating  the  determinant  is  essentially  to  estimate 
the  trace  of  the  matrix  natural  logarithm  function  In(yl). 

2.5  Partial  eigenvalue  sum 

The  partial  eigenvalue  sum  problem  is  to  compute  the  sum  of  all  eigenvalues  less 
than  a  prescribed  value  a  of  the  generalized  eigenvalue  problem 

Ax  =  XBx.  (1^) 

where  .4  and  B  are  real  N  x  N  symmetric  matrices  with  B  positive  definite. 
Specifically,  let  {A,}  be  the  eigenvalues;  one  wants  to  compute  the  quantity 

X,<a 


for  a  given  scalar  a. 

Let  B  =  LL'^  be  Cholesky  decomposition  of  B.  the  problem  (12)  is  then 
equivalent  to 

(L-KAL-'^)L'^x  =  XL^x. 

Therefore  the  partial  eigenvalue  sum  of  the  matrix  pair  (.4,.S)  is  equal  to  the 
partial  eigenvalue  sum  of  the  matrix  L~^  AL~^ .  which,  in  practice,  does  not  need 
to  be  formed  explicitly. 

A  number  of  approaches  might  be  found  in  literature  to  solve  such  problem. 
Our  approach  will  based  on  constructing  a  function  /  such  that  the  trace  of 
f[L~^AL~^)  approximates  the  desired  sum  Tq.  Specifically,  one  wants  to  con¬ 
struct  a  function  /  such  that 


/(A,) 


A,',  if  A,  <  Q 
0,  if  A,-  >  Q, 


[13) 
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for  ?■  =  1,  2, . . . ,  N.  Then  tr(/(L-^4L-^))  is  the  desired  sum 
is  to  have  the  /  of  the  form 

/(C)  =  C<?(C) 


One  of  choices 


(14) 


where 


5(C)  = 


1  +  exp  (^) 


where  k  is  a  constant.  This  function,  among  other  names,  is  known  as  the  Fermi- 
Dirac  distribution  function  [15,  p.  347].  In  the  context  of  a  physical  system,  the 
usage  of  this  distribution  function  is  motivated  by  thermodynamics.  It  directly 
represents  thermal  occupancy  of  electronic  states,  k  is  proportional  to  the  tem¬ 
perature  of  the  system,  and  q  is  the  chemical  potential  (the  highest  energy  for 
occupied  states). 

It  is  easily  seen  that  0  <  g{C,)  <  1  for  all  C  with  horizontal  asymptotes  0  and 
1.  (q,  i)  is  tire  inflection  point  of  g  and  the  sign  of  k  determines  whether  g  is 
decreasing  [k  >  0)  or  increasing  (k  <  0).  For  our  application,  we  want  the  sum 
of  all  eigenvalues  less  than  a,  so  we  use  /c  >  0.  The  magnitude  of  x  determines 
how  “close"  the  function  g  maps  <  q  to  1  and  >  a  to  0.  As  k  — O"*",  the 
function  ^(C)  rapidly  converges  to  the  step  function 


f  1  if  C  < 

\  0  if  C  >  <^ 


The  graphs  of  the  function  g{0  for  q-  =  0  and  different  values  of  the  parameter 
K  are  plotted  in  Figure  1. 


Fig.  1.  Graphs  of  5(C)  for  different  values  of  n.  where  a  =  0. 


With  this  choice  of  /(C)'  we  have 

r,=  Y.  t^xv{f{L-KAL-^])  =  j2ej  f(L-KU-^)ei.  (15) 

\,<a  ’-I 
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Ill  summary,  the  problem  of  computing  partial  eigenvalue  sum  becomes  comput¬ 
ing  the  trace  of  f{L~^AL~^). 

3  Quadratic  Form  Computing 

As  we  have  seen,  all  those  unusual  eigenvalue  problems  presented  in  section  2 
can  be  summarized  as  the  problem  of  computing  the  quadratic  form  u^f(A)u, 
where  A  is  a  N  x  N  real  matrix,  and  u  is  a  vector,  and  /  is  a  proper  defined 
function.  One  needs  to  find  an  approximate  of  the  quantity  u^f(A)u,  or  give  a 
lower  bound  I  and/or  an  upper  bound  v  of  it.  Without  loss  of  generality,  one 
may  assume  u^u  =  1. 

The  quadratic  form  computing  problem  is  first  proposed  in  [6]  for  bounding 
the  error  of  CG  method  for  solving  linear  system  of  equations.  It  has  been 
further  developed  in  [11,12,2]  and  extended  to  other  applications.  The  main 
idea  is  to  first  transform  the  problem  of  the  quadratic  form  computing  to  a 
Riemann-Stieitjes  integral  problem,  and  then  use  Gauss-type  quadrature  rules 
to ‘approximate  the  integral,  which  then  brings  the  orthogonal  polynomial  theory 
and  the  underlying  Lanczos  procedure  into  the  picture. 

Let  us  go  through  the  main  idea.  Since  A  is  symmetric,  the  eigen-decomposition 
of  A  is  given  by  .4  =  Q^AQ,  where  Q  is  an  orthogonal  matrix  and  A  is  a  diagonal 
matrix  with  increasingly  ordered  diagonal  elements  A,- .  Then  we  have 

A' 

v^fiA)u  =  v^Q^f{A)Qu  =  u^f(A)u  =  ^/(Ailf/i'. 

1  =  1 

where  u  =  (uj)  =  Qu.  The  last  sum  can  be  considered  as  a  Riemann-Stieitjes 
integral 

it^f{A)u=  f  f{X)dfi{X), 

J  a 

where  the  measure  ^(A)  is  a  piecewise  constant  function  and  defined  by 

0,  if  A  <  a  <  Ai , 

p(^]  =  <  T2']=i  if  A,  <  A  <  A,+i 

^  Ej=i  =  1.  if  6  <  A,v  <  A 

and  a  and  b  are  the  lower  and  upper  bounds  of  the  eigenvalues  A,-. 

To  obtain  an  estimate  for  the  Riemann-Stieitjes  integral,  one  can  use  the 
Gauss-type  quadrature  rule  [9,  7].  The  general  quadrature  formula  is  of  the  form 

n  m 

/[/]  =  +  J2pkf{n-).  (16) 

j-l  A-  =  l 

where  the  weights  {u;,}  and  {pA-}  and  the  nodes  {9j}  are  unknown  and  to  be 
determined.  The  nodes  {ta}  are  prescribed.  If  m  —  0,  then  it  is  the  well-known 
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Gauss  rule.  If  iv  =  1  and  n  =  a  or  tj  =  6,  it  is  the  Gauss-Radau  rule.  The 
Gauss-Lobatto  rule  is  for  m  =  2  and  Ti  =  a  and  t<>  =  b. 

The  accuracy  of  the  Gauss-type  quadrature  rules  may  be  obtained  by  an 
estimation  of  the  remainder  R[f]. 

R[f]^  f  -  I[f]. 

J  a 


For  example,  for  the  Gauss  quadrature  rule, 


R[f]  = 


where  a  <  rj  <  b.  Similar  formulas  exist  for  Gauss-Radau  and  Gauss-Lobatto 
rules.  If  the  sign  of  i?[/]  is  determined,  then  the  quadrature  formula  /[/]  is  a 
lower  bound  (if  R[f]  >  0)  or  an  upper  lower  bound  (if  R[f]  <  0)  of  the  quantity 
11^  f{A)u. 

Let  us  briefly  recall  how  the  weights  and  the  nodes  in  the  quadrature  formula 
are  obtained.  First,  we  know  that  asequence  of  polynomials  po(A),  pi  (A),  P2(A),  •  •  • 
can  be  defined  such  that  they  are  orthonormal  with  respect  to  the  measure 


/■*  f  1  if  ?  =  j 

J  Pi(A)p,j(A)dp(A)  =  ifi^j 

where  it  is  assumed  that  the  normalization  condition  J  d^J-  =  1  (i.e.,  u  =  1). 
The  sequence  of  orthonormal  polynomials  7rj(A)  satisfies  a  three-term  recurrence 

IjPjW  =  (A  -  aj)pj_i(A)  -  7j_iPj_2(A), 

for  j  =  l,2,...,7r  with  p_i(A)  =  0  and  po(A)  =  1-  Writing  the  recurrence  in 
matrix  form,  we  have 

Ap(A)  =  Tnp[X)  +  ')‘nPn(X)£i\ 


where 

p(A)^  =  [po(A),pi(A) . p„_i(A)], 


fin  =[0.0 . 1] 


and 


/Qi  71 
7i  02  72 


Tn  = 


72  “3 


\  7n-l  1 


Then  for  the  Gau,ss  quadrature  rule,  the  eigenvalues  of  T„  (which  are  the  zeros 
of  p„(A))  are  the  nodes  9,.  The  weights  utj  are  the  squares  of  the  first  elements 
of  the  normalized  (i.e,.  unit  norm)  eigenvectors  of  T,, . 
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For  the  Gauss- Radau  and  Gauss- Lobatto  rules,  the  nodes  {n.}  and 

weights  {^j},{Pj]  come  from  eigenvalues  and  the  squares  of  the  first  elements 
of  the  normalized  eigenvectors  of  an  adjusted  tridiagonal  matrices  of  T„+i ,  which 
has  the  prescribed  eigenvalues  a  and/or  6. 

To  this  end,  we  recall  that  the  classical  Lanczos  procedure  is  an  elegant  way 
to  compute  the  orthonormal  polynomials  {pj(A)}  [16. 11].  We  have  the  following 
algorithm  in  summary  form.  We  refer  it  as  the  Gauss-Lanczos  (GL)  algorithm. 


GL  algorithm:  Let  A  be  a  N  x  N  real  symmetric  matrix,  u  a  real  N  vector  with 
=  1.  ,f  i.s  a  given  smooth  function.  Then  the  followdng  algorithm  computes 
an  estimation  1„  of  the  quantity  f(A)u  by  using  the  Gauss  rule  with  n  nodes. 


-  Let 

I’D 

=  u.  and  x-i  =  0  and  70  =  0 

-  For 

j  - 

=  1.2 - ,n, 

1. 

1 

H 

II 

2. 

Pi 

—  Axj-\  ctjXj-i 

3. 

= 

4. 

■Pi 

-  Compute  eigenvalues  6*^.  and  the  first  elements  u>k  of  eigenvectors  of  T„ 

-  Compute  1„  = 

We  note  that  the  “For”  loop  in  the  above  algorithm  is  an  iteration  step  of  the 
standard  symmetric  Lanczos  procedure  [16].  The  matrix  A  in  question  is  only 
referenced  here  in  the  form  of  the  matrix-vector  product.  The  Lanczos  procedure 
can  be  implemented  with  only  3  n-vectors  in  the  fast  memory.  This  is  the  major 
storage  requirement  for  the  algorithm  and  is  an  attractive  feature  for  large  scale 
problems. 

On  the  return  of  the  algorithm,  from  the  expression  of  R[f],  we  may  estimate 
the  error  of  the  approximation  ■  For  example,  if  >  0  for  any  n  and 

a  <  1]  <  b,  then  I„  is  a  lower  bound  f  of  the  quantity  f{A)u. 


Gauss- Radau-Lanczos  (GRL)  algorithm:  To  implement  the  Gauss-Radau 
rule  with  the  prescribed  node  =  a  or  n  =  b.  the  above  GL  algorithm  just 
needs  to  be  slightly  modified.  For  example,  wuth  ti  —  o,  w'e  need  to  extend  the 
matrix  T„  to 

O  J  ^ 

Here  the  parameter  o  is  chosen  such  that  rj  =  a  is  an  eigenvalue  of  T„  +  i. 
From  [10],  it  is  known  that 

0  =  a  +  6„ , 

where  6„  is  the  last  component  of  the  solution  (i  of  the  tridiagonal  system 
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Then  the  eigenvalues  and  the  first  components  of  eigenvectors  of  fn+i  gives 
the  nodes  and  weight  of  the  Gauss-Radau  rule  to  compute  an  estimation  In  of 
f{A)u. 

Furthermore,  if  (rj)  <  0  for  any  n  and  t],  a  <  rj  <  b,  then  I„  (with  b 

as  a  prescribed  eigenvalue  of  Tn+i)  is  a  lower  bound  i  of  the  quantity  u  f{A]u. 
i„  (with  a  as  a  prescribed  eigenvalue  of  Tn+i)  is  an  upper  bound  i/. 


Gauss-Lobatto-Lanczos  (GLL)  algorithm:  To  implement  the  Gauss-Lobatto 
rule.  Tn  computed  in  the  GL  algorithm  is  updated  to 


Tn  +  l  — 


Tn  i>Cn 

ipel  (p 


Here  the  pa-raiTieters  (p  and  are  chosen  so  that  ci  and  6  are  eigenvalues  of  + 
Again,  from  [10],  it  is  known  that 


(j)  = 


6„b-  fi„a 


,  ,>)  b  +  a 

and  =  7 - ’ 

On  ~  Pn 


where  S„  and  /i„  are  the  last  components  of  the  solutions  S  and  of  the  tridi¬ 
agonal  systems 


{Tn  -  aI]S  =  e„  and  (Tn  -  bl)p  =  e„  - 

The  eigenvalues  and  the  first  components  of  eigenvectors  of  T^+i  gives  the  nodes 
and  weight  of  the  Gauss-Lobatto  rule  to  compute  an  estimation  /„  of  iTf{A)u. 
Moreover,  if  /G")(?7)  >  0  for  any  ?y,  a  <  ??  <  6,  then  /„  is  an  upper  bound  u  of 
the  quantity  f{A)u. 

Finally,  we  note  that  we  need  not  always  compute  the  eigenvalues  and  the  first 
components  of  eigenvectors  of  the  tridiagonal  matrix  Tn  or  its  modifications  Tli+i 
or  fn-gi  for  obtaining  the  estimation  /„  or  /„ ,  /n  •  We  have  following  proposition. 

Proposition  1.  For  Gaussian  rule: 


In  —  f{Tn)^l- 

A-  =  l 


For  Gauss-Radau  rule: 

n 

In  —  iJJkf(Ok)  +  plf{~l)—^Jf(Tn  +  \)^l-  ( 

A-  =  l 


For  Gauss-Lobatto  rule: 

In  =  ^  +  i  )^i  ■ 

k  =  l 

Therefore,  if  the  (1,1)  entry  of  /(T,,).  f{fn  +  i)  or  /(T.  +  i)  can  be  easily 
computed,  for  example,  /(A)  =  1/A.  we  do  not  need  to  compute  the  eigenvalues 
and  eigenvectors. 
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4  Solving  the  UEPs  by  Quadratic  Form  Computing 

In  this  section,  we  use  the  GL,  GRL  and  GLL  algorithms  for  solving  those 
unusual  eigenvalue  problems  discussed  in  section  2. 


Constraint  eigenvalue  problem  Using  the  GL  algorithm  with  the  matrix  A 
and  the  vector  c,  we  have 

cf(.4- A/)-^c  =  ef(T„  -A;)-'ei+R, 

where  R  is  the  remainder.  Now  we  may  solve  reduced-order  secular  equation 


e[(T„-A/)-iei  =0 


to  find  the  largest  A  as  the  approximate  solution  of  the  problem.  This  secular 
equation  can  be  solved  using  the  method  discussed  in  [17]  and  its  implementation 
available  in  LAPACK  [1]. 


Modified  eigenvalue  problem  Again,  using  the  GL  algorithm  with  the  matrix 
A  and  the  ^■ector  c.  we  have 

l  +  c^(A-A/)-^c  =  l  +  e[(r„  -  A/)-iei  +/?.. 

where  R  is  the  remainder.  Then  we  may  solve  the  eigenvalue  problem  of  T,, 
to  approximate  some  eigenvalues  of  A,  and  then  solve  reduced-order  secular 
equation 

l-hef(T„-A/)-iei  =  0 

for  A  to  find  some  approximate  eigenvalues  of  the  modified  eigenvalue  problem. 


Constraint  quadratic  programming  By  using  the  GRL  algorithm  with  the 
prescribed  node  Ti  =  b  for  the  matrix  A  and  vector  c.  it  can  be  shown  that 

c^(A  +  A/)“-c  >  ef  (fn+i  +  A/)“-Vi 

for  all  A  >  0.  Then  by  solving  the  reduced-order  secular  equation 

for  A.  we  obtain  A„.  which  is  a  lower  bound  of  the  solution  A”:  A„  <  A* 

On  the  other  hand,  using  the  GRL  algorithm  with  the  prescribed  node  rj  =  o. 
we  have 

c'^(A  +  XI)  ' c  <  ef  (f,y+\  +  XI) 

for  all  A  >  0.  Then  by  solving  the  reduced-order  .secular  equation 

f f {Tn  +  \  +  ■£]  =  O' 
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for  A.  We  have  an  upper  bound  A„  of  the  solution  A";  A„  >  A  . 

Using  such  two-sided  approximation  as  illustrated  in  Figure  2,  the  iteration 
can  be  adaptively  proceeded  until  the  estimations  A„  and  A„  are  sufficiently 
close,  we  then  obtain  an  approximation 

X*  ^  -  ( A„  +  A„ ) 

of  the  desired  solution  A’ . 


Fig.  2.  Two-sided  approximation  approximation  of  the  solution  A*  for  the  constraint 
quadratic  programming  problem  (7)  and  (8). 


Ti-ace,  determinant  and  partial  eigenvalue  sum  As  shown  in  sections 
2.4  and  2.5,  the  problems  of  computing  trace  of  the  inverse  of  a  matrix  .4. 
determinant  of  a  matrix  A  and  partial  eigenvalue  sum  of  a  symmetric  positive 
definite  pair  {A,B)  can  be  summarized  as  the  problem  of  computing  the  trace 
of  a  corresponding  matrix  function  f{H),  where  H  =  A  or  H  =  L  U4i 
and  /(A)  =  1/A,  ln(A)  or  A/  (1  +exp(^)).  To  efficiently  compute  the  trace 
of  f(H),  instead  of  applying  GR  algorithm  or  its  variations  N  times  for  each 
diagonal  element  of  f(H),  we  may  use  a  Monte  Carlo  approach  which  only 
applies  the  GR  algorithm  m  times  to  obtain  an  unbiased  estimation  of  tr(/(H)). 
For  practical  purposes,  m  can  be  chosen  much  smaller  than  A' .  The  saving  in 
computational  costs  could  be  significant.  Such  a  Monte  Carlo  approach  is  based 
on  the  following  lemma  due  to  Hutchinson  [14]. 

Proposition  2.  Let  C  =  (c,-,)  be  an  N  x  N  symmetric  matrix  with  tiiC)  0, 
Let  V  be  the  discrete  random  variable  which  takes  the  values  1  and  —1  each  irith 
probability  0.5  and  let  z  be  a  vector  of  n  independent  samples  from  V.  Then 
z^Cz  is  an  unbiased  estimator  of  tiiC),  i.e.. 

E(=^Cz)  =  tiiO. 
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and 

var(z'^Cz)  =  2'^cjj. 

To  use  the  above  proposition  in  practice,  one  takes  m  such  sample  vectors 
Zi,  and  then  uses  GR  algorithm  or  its  variations  to  obtain  an  estimation  /*,’*,  a 
lower  bound  Cn'^  and/or  an  upper  bound  ia[’^  of  tlie  quantity  cf  f{H)zi: 

clp  <  ^f{H)zi  <  t^ir- 

Then  by  taking  the  mean  of  the  m  computed  estimation  /n  '  or  lower  and  upper 
bounds  Ch^  and  Vn'K  we  have 

m 

tr(/(//))«-;^/(') 

m  * 


m  ,  m  ,  m 


i  =  i 


1  =  1 


It  is  natural  to  expect  that  with  a  suitable  sample  size  m.  the  mean  of  the 
computed  bounds  yields  a  satisfactory  estimation  of  the  quantity  tr(/(ii/')).  To 
assess  the  quality  of  such  estimation,  one  can  also  obtain  probabilistic  bounds 
of  the  approximate  value  [2]. 


5  Numerical  Examples 

In  this  section,  we  present  some  numerical  examples  to  illustrate  our  quadratic 
form  based  algorithms  for  solving  some  of  the  unusual  eigenvalue  problems  dis¬ 
cussed  in  section  2. 


5.1  Trace  and  determinant 

Numerical  results  for  a  set  of  test  matrices  presented  in  Tables  1  and  2  are 
first  reported  in  [2].  Some  of  these  test  matrices  are  model  problems  and  some 
are  from  practical  applications.  For  example,  VFH  matrix  is  from  the  analysis  of 
transverse  vibration  of  a  Vicsek  fractal.  These  numerical  experiments  are  carried 
out  on  an  Sun  Sparc  workstation.  The  so-called  “exact”  value  is  computed  by 
using  the  standard  methods  for  dense  matrices.  The  numbers  in  the  “Iter”- 
column  are  the  number  of  iterations  n  required  for  the  estimation  to  reach 
stationary  \'alue  within  the  given  tolerance  value  tol  =  10“"*.  namely. 

|;„  -  /„_i|  <  tol  *  |/„|. 

The  number  of  random  sample  vector  r,-  is  m  =  20.  For  those  test  matrices,  the 
relative  accuracy  of  the  new  approach  within  0.3%  to  8.2%  may  be  sufficient  for 
practical  purposes. 
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Table  1.  Numerical  Results  of  estimating  tr(^  ') 


Matrix 

N 

"Exact” 

Iter 

Estimated 

Rel.err 

— 

Poisson 

900 

5.126e  +  02 

30-50 

5.020e  +  02 

2.0% 

VFH 

625 

5.383e  +  02 

12-21 

5.366e  +  02 

0.3% 

VVathen 

481 

2.681e  +  01 

33-58 

2,667e  +  01 

0.5% 

Lehmer 

200 

2.000e  +  04 

38-70 

2.017e  +  04 

0.8% 

Table  2.  Numerical  results  of  estimating  ln(clet(.4))  =  tr(ln.4) 


Matrix 

N 

“Exact” 

Iter 

Estimated 

Rel.err 

Poisson 

900 

1.065e  +  03 

11-29 

1.060e  +  03 

0.4%. 

VFH 

625 

3.677e  +  02 

10-14 

3.661e  +  02 

0.4%. 

Heat  Flow 

900 

5.643e  +  01 

4 

5.669e  +  01 

0.4% 

Pei 

300 

5  - 0  /  6  “f" 

2-3 

5.240e  +  00 

8.2% 
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5.2  Partial  eigenvalue  sum 


Here  we  present  a  numerical  example  from  the  computation  of  the  total  energy 
of  an  electronic  structure.  Total  energy  calculation  of  a  solid  state  system  is 
necessary  in  simulating  real  materials  of  technological  importance  [18].  Figure  3 
shows  a  carbon  cluster  that  forms  part  of  a  “knee  structure  connecting  nan¬ 
otubes  of  different  diameters  and  the  distribution  of  eigenvalues  such  carbon 
structure  with  240  atoms.  One  is  interested  in  computing  the  sum  of  all  these 
eigenvalues  less  than  zero.  Comparing  the  performance  of  our  method  with  dense 
methods,  namely  symmetric  QR  algorithm  and  bisection  method  in  LAPACK. 
our  method  achieved  up  to  a  factor  of  20  speedup  for  large  system  on  an  Con¬ 
vex  Exemplar  SPP-1200  (see  Table  3).  Because  of  large  memory  requirements, 
we  were  not  able  to  use  LAPACK  divide-and-conquer  symmetric  eigenroutines. 
Furthermore,  algorithms  for  solving  large-sparse  eigenvalue  problems,  such  as 
Lanczos  method  or  implicitly  restarted  methods  for  computing  some  eigenval¬ 
ues  are  found  inadequate  due  to  large  number  of  eigenvalues  required.  Since  the 
problem  is  required  to  be  solved  repeatly,  we  are  now  able  to  solve  previously 
intractable  large  scale  problems.  The  relative  accuracy  of  new  approach  within 
0.4%  to  1.5%'  is  satisfactory  for  the  application  [3], 


■ll■llll  llillll 


Fig.  3.  A  carbon  cluster  that  forms  part  of  a  “knee''  structure,  and  the  corresponding 
spectrum 
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Table  3.  Performance  of  our  method  vs.  dense  methods  on  Convex  Exemplar  SPP- 
1200.  Here,  10  Monte  Carlo  samples  were  used  to  obtain  estimates  for  each  systems 
size. 


Dense  methods  I 

GR  Algorithm 

%  Relative 

n 

m 

Partial  Sum 

QR  Time 

B1  Time 

Estimate 

Time 

Error 

480 

349 

-4849.8 

7.4 

7.6 

-4850.2 

2.8 

0.01 

960 

648 

-9497.6 

61.9 

51.8 

-9569.6 

18.5 

0.7 

1000 

675 

-9893.3 

80.1 

58.6 

-10114.1 

22.4 

2.2 

1500 

987 

-14733.1 

253.6 

185.6 

-14791.8 

46.4 

0.4 

1920 

1249 

-18798.5 

548.3 

387.7 

-19070.8 

72.6 

1.4 

2000 

1299 

-19572.9 

616.9 

431.8 

-19434.7 

78.5 

0.7 

2500 

1660 

-24607.6 

1182.2 

844.6 

-24739.6 

117.2 

0.5 

3000 

1976 

-29471.3 

1966.4 

1499.7 

-29750.9 

143.5 

0.9 

3500 

2276 

-34259.5 

3205.9 

2317.4 

-33738.5 

294.0 

1.5 

4000 

2571 

-39028.9 

4944.3 

3553.2 

-39318.0 

306.0 

0.7 

4244 

2701 

-41299,2 

5915.4 

4188.0 

-41389.8 

339.8 

0.2 

6  Concluding  Remarks 

In  this  paper,  we  have  surveyed  numerical  techniques  based  on  computing  quadratic 
forms  for  solving  .some  unusual  eigenvalue  problems.  Although  there  exist  some 
numerical  methods  for  solving  these  problems  (see  [13]  and  references  theiein), 
most  of  these  can  be  applied  only  for  small  and/or  dense  problems.  The  tech¬ 
niques  presented  here  reference  the  matrix  in  question  only  through  a  matrix- 
vector  product  operation.  Hence,  they  are  more  suitable  for  large  sparse  prob¬ 
lems. 

The  new  approach  deserves  further  study:  in  particular,  for  error  e.stimation 
and  convergence  of  the  methods.  An  extensive  comparative  study  of  the  trade¬ 
offs  in  accuracy  and  computational  costs  between  the  new  approach  and  other 
existing  methods  should  be  conducted. 

Acknowledgement  Z.  B.  was  supported  in  part  by  an  NSF  grant  ASC-9313958. 
an  DOE  grant  DE-FG03-94ER25219. 
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Abstract.  In  this  work  we  present  a  parallel  version  of  two  precondi¬ 
tioners.  The  first  one,  is  based  on  a  partially  decoupled  block  form  of  the 
ILU.  We  call  it  Block-ILU (fill, t , overlap) ,  because  it  permits  the  control 
of  both,  the  block  fill  and  the  block  overlap.  The  second  one,  is  based  on 
the  SPAI  (SParse  Approximate  Inverse)  method.  Both  methods  are  anal¬ 
ysed  and  compared  to  the  ILU  preconditioner  using  the  Bi-CGSTAB  to 
solve  general  sparse,  nonsymmetric  systems.  Results  have  been  obtained 
for  different  matrices.  The  preconditioners  have  been  compared  in  terms 
of  robustness,  speedup  and  time  of  execution,  to  determine  which  is  the 
best  one  in  each  situation.  These  solvers  have  been  implemented  for  dis¬ 
tributed  memory  multicomputers,  making  use  of  the  MPI  message  passing 
standard  library. 


Keywords:  parallel  preconditioners,  nonsymmetric  linear  systems,  Block- 
ILU,  SPAI,  Bi-CGSTAB. 


1  Introduction 

In  the  development  of  simulation  programs  in  different  research  fields,  from  fluid 
mechanics  to  semiconductor  devices,  the  solution  of  the  systems  of  equations 
which  arise  from  the  discretization  of  partial  differential  equations,  is  the  most 
CPU  consuming  part  [13].  In  general,  the  matrices  are  very  large,  sparse,  non- 
synimetric  and  are  not  diagonal  dominant  [3, 12].  So,  using  an  effective  method 
to  solve  the  system  is  essential. 

We  are  going  to  consider  a  linear  system  of  equations  such  as; 

Ax  =  b,  Aer'^”,  a:,6er  (1) 

where  A  is  a  sparse,  nonsymmetric  matrix. 
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Direct  methods,  such  as  Gaussian  elimination,  LU  factorization  or  Cholesky 
factorization  may  be  excessively  costly  in  terms  of  computational  time  and  mem¬ 
ory,  specially  when  n  is  large.  Due  to  these  problems,  iterative  methods  [1, 14] 
are  generally  preferred  for  the  solution  of  large  sparse  systems.  In  this  work  we 
have  chosen  a  non  stationary  iterative  solver,  the  Bi-Conjugate  Gradient  Sta¬ 
bilized  [19].  Bi-CGSTAB  is  one  of  the  methods  that  obtains  better  results  in 
the  solution  of  non-symmetric  linear  systems,  and  its  attractive  convergence  be¬ 
haviour  has  been  confirmed  in  many  numerical  experiments  in  different  fields  [7]. 

In  order  to  reduce  the  number  of  iterations  needed  in  the  Bi-CGSTAB  pro¬ 
cess,  it  is  convenient  to  precondition  the  matrices.  This  is,  transform  the  linear 
system  into  an  equivalent  one,  in  the  sense  that  it  has  the  same  solution,  but 
which  has  more  favourable  spectral  properties. 

Looking  for  efficient  parallel  preconditioners  is  a  \’ery  important  topic  in  cur¬ 
rent  research  in  the  field  of  scientific  computing.  A  broad  class  of  preconditioners 
are  based  on  incomplete  factorizations  (incomplete  Cholesky  or  ILU)  of  the  co¬ 
efficient  matrix.  One  important  problem  associated  with  these  preconditioners 
is  their  inherently  sequential  character.  This  implies  that  they  are  very  hard 
to  parallelise,  and  only  a  modest  account  of  parallelism  can  be  attained,  with 
complicated  implementations.  So,  it  is  important  to  find  alternative  forms  of 
preconditioners  that  are  more  suitable  for  parallel  architectures. 

The  first  preconditioner  we  present  is  based  on  a  partially  decoupled  block 
form  of  the- ILU  [2].  This  new  version,  called  Block-ILV {fill,TyOverlap),  permits 
the  control  of  its  effectiveness  through  a  dropping  parameter  r  and  a  block 
fill-in  parameter.  Moreover,  it  permits  the  control  of  the  overlap  between  the 
blocks.  We  have  verified  that  the  fill-in  control  is  very  important  for  getting  the 
most  out  of  this  preconditioner.  Its  main  advantage  is  that  it  presents  a  very 
efficient  parallel  execution,  because  it  avoids  the  data  dependence  of  sequential 
ILU,  obtaining  high  performance  and  scalability.  As  a  disadvantage  is  that  it  is 
less  robust  than  complete  ILU,  due  to  the  loss  of  information,  and  this  can  be  a 
problem  in  very  bad  conditioned  systems. 

The  second  preconditioner  we  present  is  an  implementation  of  preconditioner 
SPAI  {SParse  Approximate  Inverse)  [5,8].  This  alternative  has  been  proposed 
in  the  last  few  years  as  an  alternative  to  ILU,  in  situations  where  the  last  obtain 
very  poor  results  (situations  which  often  arise  when  the  matrices  are  indefinite 
or  have  large  nonsymmetric  parts) .  These  methods  are  based  on  finding  a  matrix 
M  which  is  a  direct  approximation  to  the  inverse  of  A.  so  that  AM  «  I. 

This  paper  presents  a  parallel  version  of  these  preconditioners.  Section  2 
presents  the  iterative  methods  we  have  used.  Section  3  introduces  the  charac¬ 
teristics  of  the  Block-ILU  and  the  SPAI  preconditioners.  Section  4  indicates  the 
numerical  experiment  we  have  studied.  The  conclusions  are  given  in  Section  5. 

2  Iterative  Methods 

The  iterative  w.ethods  are  a  wide  range  of  techniques  that  use  successive  approx¬ 
imations  to  obtain  more  accurate  solutions  to  linear  systems  at  each  step.  There 
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are  two  types  of  iterative  methods.  Stationary  methods,  like  Jacobi.  Gauss- 
Seidel,  SOR,  etc.,  are  older,  simpler  to  understand  and  implement,  but  usually 
not  very  effective.  Nonstationary  methods,  like  Conjugate  Gradient,  Minimum 
Residual,  QMR,  Bi-CGSTAB,  etc.,  are  a  relatively  recent  development  and  can 
be  highly  effective.  These  methods  are  based  on  the  idea  of  sequences  of  orthog¬ 
onal  vectors. 

In  recent  years  the  Conjugate  Gradient-Squared  (CGS)  method  [1]  has  been 
recognized  as  an  attractive  variant  of  the  Bi-Conjugate  Gradient  (Bi-CG)  for 
the  solution  of  certain  classes  of  nonsymmetric  linear  systems.  Recent  studies  in¬ 
dicate  that  the  method  is  often  competitive  with  other  well  established  methods, 
such  as  GMRES  [15].  The  CG-S  method  has  tended  to  be  used  in  the  solution 
of  two  or  tree-dimensional  problems,  despite  its  irregular  convergence  pattern, 
because  when  it  works  -which  is  most  of  the  time-  it  works  quite  well.  Recently, 
van  der  Vorst  [19]  has  presented  a  new  variant  of  Bi-CG,  called  Bi-CGSTAB, 
which  combines  the  efficiency  of  CGS  with  the  more  regular  convergence  pattern 

of  Bi-CG.  .  ..  jr  ml 

In  this  work  we  have  chosen  the  Bi— Conjugate  Gradient  Stabilized  [1,19J, 

because  of  its  attractive  convergence  behaviour.  This  method  was  developed  to 
solve  nonsymmetric  linear  systems  while  avoiding  irregular  convergence  patterns 
of  the  Conjugate  Gradient  Squared  methods.  Bi-CGSTAB  requires  two  matrix- 
vector  products  and  four  inner  products  per  iteration. 

In  order-  to  reduce  the  number  of  iterations  needed  in  the  Bi-CGSTAB  pro¬ 
cess,  it  is  convenient  to  precondition  the  matrices.  The  preconditioning  can  be 
applied  in  two  ways:  either  we  solve  the  explicitly  preconditioned  system  us¬ 
ing  the  normal  algorithm,  or  we  introduce  the  preconditioning  process  in  the 
iterations  of  the  Bi-CGSTAB.  This  last  method  is  usually  preferred. 


3  Preconditioners 

The  rate  at  which  an  iterative  method  converges  depends  greatly  on  the  spectrum 
of  the  coefficient  matrix.  Hence  iterative  methods  usually  involve  a  second  matrix 
that  transforms  the  coefficient  matrix  into  one  with  a  more  favorable  spectrum. 
A  preconditioner  is  a  matrix  that  affects  such  a  transformation. 

We  are  going  to  consider  a  linear  system  of  equations  such  as: 

Ax  =  b,  x,beE'^  (2) 

where  A  is  a  large,  sparse,  nonsymmetric  matrix. 

If  a  matrix  M  is  right-approximates  coefficient  matrix  A  in  some  w'ay,  we 
can  transform  the  original  system  as  follows: 

Ax-b  -A  AM~'-{Mx)-b  (3) 

Similarly,  a  left-approximates  can  be  defined  by: 

Ax  =  b  -4  M~^Ax  =  M~^b  (4) 
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Another  way  of  deriving  the  preconditioned  conjugate  gradients  method 
would  be  to  split  the  preconditioner  as  M  =  Mi  M2,  where  the  matrices  Mi 
and  M-2  are  called  the  left  and  right  preconditioners,  and  to  transform  the  sys¬ 
tem  as 

Ax  =  h  (5) 

In  this  section  we  present  a  parallel  version  of  two  preconditioners.  The  hist 
one,  is  based  on  a  partially  decoupled  block  form  of  the  ILU.  We  call  it  Block- 
ILU(fill,r, overlap),  because  it  permits  the  control  of  both  the  block  hll  and  the 
block  overlap.  The  second  one  is  based  on  the  SPAI  (SParse  Approximate  In¬ 
verse)  method.  Both  methods  are  analysed  and  compared  to  the  ILU  precondi¬ 
tioner  using  the  Bi-CGSTAB  to  solve  general  sparse,  nonsymmetric  systems. 

3.1  Parallel  Block-ILU  preconditioner 

In  this  section  we  present  a  new  version  of  a  preconditioner  based  on  a  par¬ 
tially  decoupled  block  form  of  the  ILU  [2].  This  new  version,  called  Block- 
ILV {fill, T, overlap),  permits  the  control  of  its  effectiveness  through  a  dropping 
parameter  r  and  a  block  hll-in  parameter.  Moreover,  it  permits  the  control  of 
the  overlap  between  the  blocks.  We  have  verihed  that  the  hll-in  control  is  very 
important  for  getting  the  most  out  of  this  preconditioner.  The  original  matrix 
is  subdivided  into  a  number  of  overlapping  blocks,  and  each  block  is  assigned  to 
a  processor.  This  setup  produces  a  partitioning  effect  represented  in  Figure  1, 
for  the  case  of  4  processors,  where  the  ILU  factorization  for  all  the  blocks  is 
computed  in  parallel,  obtaining  .4,  =  LtUi,  I  ^  i  ^  p,  where  p  is  the  number  of 
blocks.  Due  to  the  characteristics  of  this  preconditioner,  there  is  a  certain  loss 
of  information.  This  means  that  the  number  of  iterations  will  increase  as  the 
number  of  blocks  increases  (as  a  direct  consequence  of  increasing  the  number  of 
processors).  This  loss  can  be  compensated  to  a  certain  extent  by  the  information 
IJiovided  by  the  overlapping  zones. 
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V 


w 


To  create  the  preconditioner  the  rows  of  each  block  indicated  for  the  param¬ 
eter  overlap  are  interchanged  between  the  processors.  These  rows  correspond 
to  regions  A  and  C  of  figure  2.  After,  the  factorization  is  carried  out.  Within 
the  loop  of  algorithm  of  resolution  it  is  necessary  to  carry  out  the  operation  of 
preconditioning 

LiUiV  =  w  (6) 

To  reduce  the  number  of  operations  of  the  algorithm,  each  processor  only  works 
with  its  local  rows.  The  first  operation  is  to  extend  vector  w’s  information  to  the 
neighbouring  processors.  Later  we  carry  out  in  each  processor  the  resolution  of 
the  superior  and  inferior  triangular  system  to  calculate  vector  v.  As  regions  A 
and  C  have  also  been  calculated  by  other  processors,  the  value  that  we  obtain  will 
vary  in  different  processors.  In  order  to  avoid  this  and  improve  the  convergency  of 
the  algorithm  it  is  necessary  to  interchange  these  data  and  calculate  the  average 
of  these  values. 

The  main  advantage  of  this  method  is  that  it  presents  a  very  efficient  parallel 
execution,  because  it  avoids  the  data  dependence  of  sequential  ILU.  theieb> 
obtaining  high  performance  and  scalability.  A  disadvantage  is  that  it  is  less 
robust  than  complete  ILU,  due  to  the  loss  of  information,  and  this  can  be  a 
problem  in  very  bad  conditioned  systems,  as  we  will  show  in  section  4. 


3.2  Parallel  SPAI  preconditioner 

One  of  the  main  drawback  of  ILU  preconditioner  is  the  low  parallelism  it  implies. 
A  natural  way  to  achive  parallelism  is  to  compute  an  approximate  inverse  M  of 
.4,  such  that  M -A  ~  /  in  some  sense.  A  simple  technique  for  finding  approximate 
inverses  of  arbitrary  sparse  matrices  is  to  attempt  to  find  a  sparse  inatiix  AI 
which  minimizes  the  Frobenius  norm  of  the  residual  matrix  AM  -  I , 

F{M)  =  \\AM  -  711'^  (f) 
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A  matrix  M  whose  value  F{M)  is  small  would  be  a  right-approximate  inverse 
of  A.  Similarly,  a  left-approximate  inverse  can  be  defined  by  using  the  ob  jective 
function 

\\MA  -  I\\l  (8) 

These  cases  are  very  similar.  The  objective  function  7  decouples  into  the  sura 
of  the  squares  of  the  2-norms  of  the  individual  columns  of  the  residual  matrix 
AM  -  /, 

n 

F{M)  =  \\AM  -  I\\l  =  ||A7n,  -  ejg  (9) 

i=i 

in  which  ej  and  rtij  are  the  j-th  columns  of  the  identity  matrix  and  of  the  matrix 
M.  There  are  two  different  ways  to  proceed  in  order  to  minimize  9.  The  first  one 
consists  of  in  minimizing  it  globally  as  a  function  of  the  matrix  M,  e.g.,  by  a 
gradient-type  method.  Alternatively,  in  the  second  way  the  individual  functions 

fj{m)  =  WArUj  -ej\\l;j  =  (10) 

can  be  minimized.  This  second  approach  is  attractive  for  parallel  computers, 
and  it  is  the  one  we  have  used  in  this  paper.  A  good,  inherently  parallel  solution 
would  be  to  compute  the  columns  k  of  M,  rrik,  in  an  independent  way  from  each 
other,  resulting: 

n 

|lAM-/|||.  =  5]||(AM-/)eA-||^  (11) 

A~1 

The  solution  of  11  can  be  organized  into  n  independent  systems, 

min  -  efcl|2,  fc=l,...,n,  e*  =  (0, ...,  0, 1, 0, ...,  0)^  (12) 

mi. 

We  have  to  solve  n  systems  of  equations.  If  these  linear  systems  were  solved 
without  taking  advantage  of  sparsity,  the  cost  of  constructing  the  preconditioner 
would  be  of  order  n^.  This  is  because  each  of  the  n  columns  would  require  0(ti) 
operations.  Such  a  cost  would  become  unacceptable  for  large  linear  systems. 
To  avoid  this,  the  iterations  must  be  performed  in  sparse-sparse  mode.  As  A  is 
sparse,  we  could  work  with  systems  of  much  lower  dimension.  Let  L{k)  be  the  set 
of  indices  j  such  that  mfc(j)  ^  0.  We  denote  the  reduced  vector  of  unknowns  as 
ini,  {L)  by  7hk{L)  and  the  resulting  submatrix  A(L,  L)  as  A.  Similarly,  we  dehne 
e.f;  =  Cfc(L).  Now,  solving  12  is  transformed  into  solving; 

min  Him*;  -  efcl|2  (13) 

mi. 

Due  to  the  sparsity  of  A  and  M,  the  dimension  of  systems  13  is  very  small.  To 
solve  these  systems  we  have  chosen  direct  methods.  We  are  using  these  methods 
instead  of  an  iterative  one,  mainly  because  the  systems  13  are  very  small  and 
almost  dense.  Of  the  different  alternatives  we  have  concentrated  on  QR.  and  LU 
methods  [17]. 

The  QR.  factorization  of  matrix  A  €  is  given  by  .4  =  QB  where  R 

is  an  m-by-n  upper  triangular  matrix  and  Q  is  an  m-bv-m  unitary  matrix. 
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This  factorization  is  better  than  LU  because  it  can  be  used  for  the  case  of 
non  squared  matrices,  and  also  works  in  some  cases  in  which  LU  fails  due  to 
problems  with  too  small  pivots  [10].  The  cost  of  this  factorization  is  0{-^n  ). 
The  other  direct  method  we  have  tested  is  LU.  This  factorization  and  the  closely 
related  Gaussian  elimination  algorithm  are  widely  used  in  the  solution  of  lineai 
systems  of  equations.  LU  factorization  expresses  the  coefficient  matrix,  .4.  as 
the  product  of  a  lower  triangular  matrix,  L,  and  an  upper  triangular  matrix,  U. 
After  factorization,  the  original  system  of  equations  can  be  written  as  a  pair  of 
triangular  systems, 

Ax  =  b  (14) 

Ly  =  b  Ux  =  y  (15) 

The  first  of  the  systems  can  be  solved  by  forward  reduction,  and  then  back  sub¬ 
stitution  can  be  used  to  solve  the  second  system  to  give  x.  The  advantage  of 
this  factorization  is  that  its  cost  is  0(1  n^),  lower  than  that  of  QR.  We  have 
implemented  the  two  solvers  in  our  code,  specifically,  the  QR  and  the  LU  de¬ 
composition  with  pivoting.  An  efficient  implementation  consists  of  selecting  the 
QR,  method  if  the  matrix  is  not  squared.  In  the  case  that  it  is  squared,  we  will 
resolve  the  system  by  using  LU,  as  this  is  faster  than  QR.  Morever,  there  is  also 
the  possibility  of  using  QR  if  some  error  is  produced  in  the  construction  of  the 
factorization  LU. 

•In  the  next  section  we  have  compared  the  results  we  have  obtained  with  these 
methods.  In  this  code  the  SPAI  parameter  L  indicates  the  number  of  neighbours 
of  each  point  we  use  to  reduce  the  system.  The  main  drawback  of  preconditioners 
based  on  the  SPAI  idea  is  that  they  need  more  computations  than  the  rest.  So, 
in  the  simplest  situations  and  when  the  number  of  processors  is  small,  they  may 
be  slower  than  ILU  based  preconditioners. 

4  Numerical  experiments 

4.1  Test  problem  specification 

The  matrices  we  have  tested  are  from  the  simulation  of  heterojunction  bipo¬ 
lar  transistors  [9,11].  These  matrices  are  highly  sparse,  not  symmetric  and,  m 
general,  not  diagonal  dominant.  They  were  obtained  by  applying  the  method  of 
finite  elements  to  heterojunction  bipolar  devices,  in  concrete  for  transistors  of 
InP/InGaAs  [6]. 

The  basic  equations  of  the  semiconductor  devices  are  Poisson’s  eq.  and  elec¬ 
tron  and  hole  continuity,  in  a  stationary  state: 

div{e'V’il))  =  q{p  -  n  +  -  N'^)  (16) 

div{Jn)  = 

div{Jj,)  =  -qR  (1^) 

where  U  is  the  electrostatic  potential,  q  is  the  electronic  charge,  £  is  the  dielectric 
constant  of  the  material,  n  and  p  are  the  electron  and  hole  densities,  Aj  and 
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Table  1.  Time  (sec)  for  Block-ILU 


Proc 

2 

4 

6 

8 

10 

FILL=0/Overlap=l 

0.86 

0.47 

0.29 

0.21 

0.18 

FILL=0/Overlap=3 

0.83 

0.41 

0.29 

0.22 

0.18 

FILL=0/Overlap=6 

0.83 

0.43 

0.29 

0.21 

0.17 

FILL=2/Overlap=l 

0.38 

0.18 

0.12 

0.094 

0.077 

FILL=4/Overlap=l 

0.38 

0.19 

0.12 

0.095 

0.077 

are  the  doping  effective  concentration  and  J„  and  Jp  are  the  electron  and 
hole  current  densities,  respectively.  The  term  R  represents  the  volume  recombi¬ 
nation  term,  taking  into  account  Schokley-Read-Hall,  Auger  and  band-to-band 
recombination  mechanisms  [20]. 

For  this  type  of  semiconductors  it  is  usual  to  apply  at  first  a  Gummel  type 
method  of  resolution  [16],  which  uncouples  the  three  equations  and  allows  us  to 
obtain  an  initial  solution  for  the  system  coupled  with  the  three  equations.  For  the 
semiconductors  we  use  we  have  to  solve  the  three  equations  simultaneously.  The 
pattern  of  these  matrices  is  similar  to  those  in  other  fields  such  as  applications 
of  CFD  [3,4]. 

We  have  distributed  the  matrix  in  rows  and  have  obtained  an  optimum  dis¬ 
tribution  of  the  work  load  among  the  processors. 

All  the  results  have  been  obtained  in  a  CRAY  T3E  multicomputer  [18].  We 
have  programmed  it  using  the  SPMD  paradigm,  with  the  MPI  library,  and  we 
have  obtained  results  with  several  matrices  of  different  characteristics. 


4.2  Parallel  Block-ILU  preconditioner 

We  have  carried  out  different  tests  to  study  how  the  parameters  of  fiU-m  and 
overlap  affect  the  time  of  calculation  and  speedup  for  the  resolution  of  a  system 
of  equations.  In  tables  1  and  2  we  show  the  times  of  execution  and  speedup 
for  a  badly  conditioned  matrix  with  N  =  25000.  Time  is  measured  from  two 
processors  onwards,  because  we  have  memory  problems  trying  to  run  the  code 
in  a  single  processor.  So  the  speedup  is  computed  as: 


speedupp  = 


where  Tp  is  the  time  of  execution  with  two  processors. 

With  respect  to  the  results  shown  in  table  1  note  that,  if  we  maintain  constant 
the  value  of  the  fill-in,  when  the  value  of  the  overlap  is  increased  the  time  of 
execution  hardly  varies.  This  is  because  the  only  variation  is  in  the  size  of  the 
message  to  be  transmitted,  whereas  the  size  of  the  overlap  zone  in  comparison 
to  the  total  is  minimum.  Therefore  the  increase  in  the  computations  is  small. 
However,  if  we  maintain  constant  the  value  of  the  overlap  and  increase  the  fill- 
in  a  significant  variation  is  observed.  This  is  because  the  number  of  iterations 
decreases  considerably 
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Table  2.  Speedup  for  Block-ILU 


Proc 


2  4  6  8  10 


FILL=0/Overlap=l  1.0 
FILL=0/Overlap=3  1.0 
FILL=0/Overlap=6  1.0 
FILL=2/Overlap=l  1.0 
FILL=4/Overlap— 1  1.0 


1.82 

2.96 

4.09 

4.77 

2.02 

2.86 

3.77 

4.66 

1.91 

2.84 

3.92 

4.53 

2.10 

3.16 

4.04 

4.93 

2.00 

3.16 

4.0 

4.93 

Table  3.  Time  (sec)  for  SPAI  with  LU 


Proc 

2 

4 

6 

8 

10 

Iter. 

£= 

=0 

2.19 

1.10 

0.76 

0.54 

0.46 

47 

=1 

1.93 

0.97 

0.66 

0.48 

0.41 

35 

=2 

1.51 

0.86 

0.52 

0.38 

0.32 

22 

=3 

1.41 

0.73 

0.49 

0.35 

0.32 

17 

=4 

1.46 

0.73 

0.49 

0.37 

0.31 

13 

=5 

1.60 

0.79 

0.51 

0.39 

0.33 

11 

As  regards  the  values  of  speedup  in  table  2,  the  values  obtained  are  signifi 
cantly  better  in  all  cases,  although  the  algorithm  obtains  slightly  better  results 
when  the  level  of  fill-in  is  increased  for  a  constant  level  of  overlap.  However, 
for  a  constant  level  of  fill-in  the  speedup  decreases  very  smoothly  as  the  level 
of  overlap  increases.  This  is  because  it  is  necessary  to  carry  out  a  large  number 
of  operations  and  the  cost  of  communications  is  also  a  little  higher.  From  the 
results  obtained  it  is  possible  to  conclude  that  the  best  option  is  to  choose  the 
lowest  value  of  overlap  with  which  we  can  assure  convergency  with  an  average 
value  of  fill-in. 


4.3  Parallel  SPAI  preconditioner 

First  we  are  going  to  compare  the  results  we  have  obtained  with  the  two  direct 
solvers  we  have  implemented  in  section  3.2.  For  a  bad  conditioned  system  of 
N  =  25000  we  have  obtained  the  results  shown  in  figure  3.  These  data  refer  to 
the  cost  of  generating  the  matrix  for  each  node  with  an  overlap  level  1.  In  this 
case  resulting  submatrices  are  of  rank  3.  Note  that  the  cost  of  QR  factorization 
is  significantly  higher  than  that  of  LU.  This  difference  is  much  larger  for  higher 
values  of  the  overlap  level. 

Table  3  shows  the  time  used  to  solve  a  badly  conditioned  matrix  with  ^  = 
25000,  as  well  as  the  number  of  iterations  of  the  Bi-CGSTAB  solver.  Note  that 
as  the  value  of  parameter  C  increases,  the  number  of  iterations  decreases  because 
the  preconditioner  is  more  exact.  As  regards  speedup,  in  all  the  cases  values  close 
to  optimum  are  obtained,  and  in  some  cases  ev'en  surpassed  due  to  phenomena 
of  superlineality.  For  this  class  of  matrices  the  optimum  value  of  parameter  £ 
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(a)  Time  versus  number  of  processors 

Fig.  3.  QR  versus  LU  on  the  CRAY  T3E  with  N=25000  (bad  conditioned  system) 

would  be  3  or  4.  From  the  rest  of  results  we  can  conclude  that  the  more  diagonally 
dominant  the  matrix,  the  smaller  is  the  optimum  value  of  this  parameter,  and 
inversely,  for  worse  conditioned  matrices  we  will  need  higher  values  of  C  to  assure 
the  convergency. 

4.4  Parallel  Block-ILU  versus  Parallel  SPAI 

In  order  to  test  the  effectiveness  of  the  parallel  implementation  of  Block-ILU 
and  SPAI,  we  have  compared  them  to  a  parallel  version  of  the  ILU {fill, t)  pre¬ 
conditioner. 

In  Figure  4,  results  are  shown  for  the  complete  solution  of  a  system  of  equa¬ 
tions  with  N  =  25000,  where  matrix  A  is  a  well-conditioned  one  (diagonal 
dominant).  Again  time  is  measured  from  two  processors,  because  we  have  mem¬ 
ory  problems  trying  to  run  the  code  in  a  single  processor.  It  can  be  seen,  in 
Figure  4(a),  that  the  parallel  SPAI  method  obtains  the  best  speedup,  and  that 
parallel  Block-ILU(0,0,1)  obtains  very  similar  results.  However,  the  ILU(0,0) 
preconditioner  obtains  very  bad  results.  This  is  because  of  the  bottleneck  im¬ 
plied  in  the  solution  of  the  upper  and  lower  sparse  triangular  systems.  On  the 
other  hand,  parallel  SPAI  is  slower  (Figure  4(b))  when  the  number  of  processors 
is  small,  because  of  the  high  number  of  operations  it  implies. 

In  Figure  5.  results  are  shown  for  a  matrix  with  N  =  25000,  corresponding 
to  a  poorly-conditioned  system.  Again  (Figure  5(a))  parallel  SPAI  and  Block- 
ILU(0.0,1)  obtain  very  similar  speedup  results.  The  ILU(0,0)  precoiiditioner  ol)- 
tains  the  worst  results.  And  again,  parallel  SPAI  is  the  slower  solution  when  the 
number  of  processors  is  small  (Figure  5(b)). 

From  the  point  of  view  of  scalability,  parallel  Block-ILU  is  worse  than  parallel 
SPAI.  This  is  due  to  the  fact  that  Block-ILU  suffers  a  loss  of  information  with 
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(a)  Speedup  versus  number  of  processors 


N.  of  processors 


(b)  Time  versus  number  of  processors 


Fig.  4.  Results  on  the  CRAY  T3E  with  N  =  25000  (well  conditioned  system) 


respect  to  the  sequential  algorithm  when  the  number  of  processors  increases. 
This  means  that,  with  some  matrices,  the  number  of  iterations,  and,  therefore, 
the  total  time  for  the  BI-CGSTAB  to  converge,  grows  when  the  number  of 
processors  increases,  thereby  degrading  the  effectiveness  of  the  preconditioner. 

Figure  6  shows  the  results  for  a  non  diagonally  dominant  and  very  badh' 
conditioned  matrix  with  N  =  25000.  In  this  case,  the  system  converges  with  the 
three  preconditioners,  but  a  significant  difference  is  noted  between  the  SPAI  and 
the  incomplete  factorizations.  Note  that  with  preconditioner  SPAI  we  obtain  a 
nearly  ideal  value  of  speedup,  whereas  in  the  other  cases  this  hardly  reaches  1. 
irrespective  of  the  number  of  processores.  However,  if  we  examine  the  measures 
of  time,  it  can  be  established  that  the  fastest  preconditionei  is  the  ILU(3.0), 
together  with  the  Block-ILU(3,0,3),  although  this  time  hardly  varies  with  dif¬ 
ferent  numbers  of  processors.  On  the  other  hand,  the  SPAI  is  much  slower  than 
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(a)  Speedup  versus  number  of  proc  essors 


(b)  Time  versus  number  of  processors 

Fig.  5.  Results  on  the  CRAY  T3E  with  N  =  25000  (bad  conditioned  system) 


the  other  two.  The  motive  for  this  behaviour  is  that,  on  the  one  hand,  Block- 
ILU  increases  considerably  the  number  of  iterations  as  the  number  of  processors 
is  increased,  due  to  the  loss  of  information  that  this  method  implies.  This  in¬ 
crease  compensates  the  reduction  in  the  cost  for  iteration,  which  means  that  the 
speedup  does  not  increase.  On  the  other  hand,  to  guarantee  convergency  we  must 
use  SPAI  with  high  values  of  C,  which  supposes  a  high  cost  of  each  iteration. 
However,  the  number  of  iterations  does  not  grow  as  the  number  of  processors 
increases,  and  thereby  we  obtain  a  high  level  of  speedup.  With  a  large  number 
of  processors,  Parallel  SPAI  probably  overcomes  ILU  based  preconditioners. 

5  Conclusions 

Choosing  the  best  preconditioner  is  going  to  be  conditioned  by  the  character¬ 
istics  of  the  system  we  have  to  solve.  When  it  is  not  a  very  badly  conditioned 
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N.  of  processors 


(a)  Speedup  versus  number  of  i)roccsMirs 


N.  of  processors 


(b)  Time  versus  number  of  processors 

Fig.  6.  Results  on  the  CRAY  T3E  with  N  =  25000  (very  bad  conditioned  system) 

system,  parallel  Block-ILU  appears  to  be  the  best  solution,  because  of  both 
the  high  level  of  speedup  it  achieves  and  the  reduced  time  it  requires  to  obtain 
the  final  solution.  The  Parallel  SPAI  preconditioner  obtains  very  good  results  in 
scalability,  so  it  could  be  the  best  choice  when  the  number  of  processors  grows. 
Moreover,  we  have  verified  that  it  achieves  convergence  in  some  situations  where 
ILU  based  preconditioners  fail.  Finally,  the  direct  parallel  implementations  of 
ILU  obtain  very  poor  results. 
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Abstract.  Parallel  preconditioned  solvers  are  presented  to  compute  a 
few  extreme  eigenvalues  and  -vectors  of  large  sparse  Hermitian  matri¬ 
ces  based  on  the  Jacobi-Davidson  (JD)  method  by  G.L.G.  Sleijpen  and 
H.A.  van  der  Vorst.  For  preconditioning,  an  adaptive  approach  is  ap¬ 
plied  using  the  QMR  (Quasi-Minimal  Residual)  iteration.  Special  QMR 
versions  have  been  developed  for  the  real  symmetric  and  the  complex 
Hermitian  case.  To  parallelize  the  solvers,  matrix  and  vector  partitioning 
is  investigated  with  a  data  distribution  and  a  communication  scheme  ex¬ 
ploiting  the  sparsity  of  the  matrix.  Synchronization  overhead  is  re<^ced 
by  grouping  inner  products  and  norm  computations  within  the  QMR 
and  the  JD  iteration.  The  efficiency  of  these  strategies  is  demonstrated 
on  the  massively  parallel  systems  NEC  Cenju-3  and  Cray  T3E. 


1  Introduction 

The  simulation  of  quantum  chemistry  and  structural  mechanics  problems  is  a 
source  of  computationally  challenging,  large  sparse  real  symmetric  or  complex 
Hermitian  eigenvalue  problems.  For  the  solution  of  such  problems,  parallel  pre¬ 
conditioned  solvers  are  presented  to  determine  a  few  eigenvalues  and  -vectors 
based  on  the  Jacobi-Davidson  (JD)  method  [9]. 

For  preconditioning,  an  adaptive  approach  using  the  QMR  (Quasi-Minimal 
Residual)  iteration  [2.5,7]  is  applied,  i.e.,  the  preconditioning  system  of  linear 
equations  within  the  JD  iteration  is  solved  iteratively  and  adaptively  by  checking 
the  residual  norm  within  the  QMR  iteration  [3,4].  Special  QMR  versions  have 
been  developed  for  the  real  symmetric  and  the  complex  Hermitian  case. 

The  matrices  .4  considered  are  generalized  sparse,  i.e.,  the  coriiputation  of 
a  matrix- vector  multiplication  A  ■  v  takes  considerably  less  than  n-  operations. 
This  covers  ordinary  sparse  matrices  as  well  as  dense  matrices  from  quantum 
chemistry  built  up  additively  from  a  diagonal  matrix,  a  few  outer  products,  and 
an  FFT.  In  order  to  exploit  the  advantages  of  such  structures  with  respect  to 
operational  complexity  and  memory  requirements  when  solving  systems  of  linear 
equations  or  eigenvalue  problems,  it  is  natural  to  apply  iterative  methods. 

To  parallelize  the  solvers,  matrix  and  vector  partitioning  is  investigated  with 
a  data  distribution  and  a  communication  scheme  exploiting  the  sparsity  of  the 
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matrix.  Synchronization  overhead  is  reduced  by  grouping  inner  products  and 
norm  computations  within  the  QMR  and  the  JD  iteration.  Moreover,  in  the 
complex  Hermitian  case,  communication  coupling  of  QMR’s  two  independent 
matrix-vector  multiplications  is  investigated. 

2  Jacobi-Davidson  Method 

To  solve  large  sparse  Hermitian  eigenvalue  problems  numerically,  variants  of  a 
method  proposed  by  Davidson  [8]  are  frequently  applied.  These  solvers  use  a 
succession  of  subspaces  where  the  update  of  the  subspace  exploits  approximate 
inverses  of  the  problem  matrix,  A.  For  .4,  A  —  A^  or  A*  =  A^  holds  where  A* 
denotes  ,4  with  complex  conjugate  elements  and  A^  =  (-4^)*  (transposed  and 
complex  conjugate). 

The  basic  idea  is:  Let  V''  be  a  subspace  of  IR”  with  an  orthonormal  ba¬ 
sis  , . . . ,  and  W  the  matrix  with  columns  S  :=  AW,  A*'  the 
eigenvalues  of  5,  and  T  a  matrix  with  the  eigenvectors  of  S  as  columns.  The 
columns  xj’  of  WT  are  approximations  to  eigenvectors  of  A  with  Ritz  val¬ 
ues  A*.  =  A Xj  that  approximate  eigenvalues  of  A.  Let  us  assume  that 

+  ^  [-Slower  j  Aupper  ].  For  j  £  js+i-i  define 

=  iA->J;i)x'i,  ij  =  {A-^I)-^q^  ,  (1) 

and  =  span(  V''"  U . . .  U  )  where  A  is  an  easy  to  invert  approxi¬ 

mation  to  A  {A  =  diag(A)  in  [8]).  Then  is  an  {m+l)-dimensiona\  subspace 
of  IR”.  and  the  repetition  of  the  procedure  above  gives  in  general  improved  ap¬ 
proximations  to  eigenvalues  and  -vectors.  Restarting  may  increase  efficiency. 

For  good  convergence,  V*  has  to  contain  crude  approximations  to  all  eigen¬ 
vectors  of  A  with  eigenvalues  smaller  than  Mower  [8].  The  approximate  inverse 
must  not  be  too  accurate,  otherwise  the  method  stalls.  The  reason  for  this  was 
investigated  in  [9]  and  leads  to  the  Jacobi-Davidson  (JD)  method  with  an  im¬ 
proved  definition  of  r*: 

[(/  -  x)  (x^)")  (i  -  ^I)  (/  -  xj  (x^)")]  r)  =  (/^  (2) 

The  projection  {I-Xj  {Xj)^)  in  (2)  is  not  easy  to  incorporate  into  the  matrix, 
but  there  is  no  need  to  do  so,  and  solving  (2)  is  only  slightly  more  expensive 
than  solving  (1). 

The  method  converges  c[uadratically  for  A  =  A. 

3  Preconditioning 

The  character  of  the  JD  method  is  determined  by  the  approximation  A  to  A.  For 
obtaining  an  approximate  solution  of  the  preconditioning  system  (2).  we  may 
try  an  iterative  approach  [3,4.9].  Here,  a  real  symmetric  or  a  complex  Hermi¬ 
tian  version  of  the  QMR  algorithm  are  used  [2.5,7]  that  are  directly  applied 
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to  the  projected  system  (2)  with  A  =  A.  The  control  of  the  QMR  iteration  is 
as  follows.  Iteration  is  stopped  when  the  current  residual  norm  is  smaller  than 
the  residual  norm  of  QMR  in  the  previous  inner  JD  iteration.  By  controlling  the 
QMR  residual  norms,  we  achieve  that  the  preconditioning  system  (2)  is  solved  in 
low  accuracy  in  the  beginning  and  in  increasing  accuracy  in  the  course  of  the  JD 
iteration.  For  a  block  version  of  JD,  the  residual  norms  of  each  preconditioning 
system  (2)  are  separately  controlled  for  each  eigenvector  to  approximate  since 
some  eigenvector  approximations  are  more  difficult  to  obtain  than  others.  This 
adapts  the  control  to  the  properties  of  the  matrix  s  spectrum. 

Algorithm  1  shows  the  QMR  iteration  used  to  precondition  JD  for  complex 
Hermitian  matrices.  The  method  is  derived  from  the  QMR  variant  described  in 
[.5].  Within  JD,  the  matrix  B  in  Algorithm  1  corresponds  to  the  matrix  [(/  - 
(  4  _  A*/)  (7  _  (.Tj)")]  of  the  preconditioning  system  (2). 

^  Per  QMR  iteration,  two  matrix- vector  operations  with  B  and  B*  (marked  by 
frames  in  Algorithm  1)  are  performed  since  QMR  bases  on  the  non-Hermitian 
Lanczos  algorithm  that  requires  operations  with  B  and  but  not  with 

B^  [7].  For  real  symmetric  problems,  only  one  matrix-vector  operation  per  QMR 
iteration  is  necessary  since  then  =  Bp"  and  thus  =  q"  -  {t' h'')v"  liold. 
The  only  matrix-vector  multiplication  to  compute  per  iteration  is  then  Bu’' 
Naturally,  B  is  not  computed  element-wise  from  [{I-x)  [x))^)  {,A-\]1)  (7- 
(:r'^)^)]:  the  operation  Bp\  e.g.,  is  splitted  into  vector- vector  operations  and 
one  matrix- vector  operation  with  A. 

Note  that  the  framed  matrix-vector  operations  in  the  complex  Hermitian 
QMR  iteration  are  independent  from  each  other.  This  can  be  exploited  for  a 
parallel  implementation  (see  5.2).  Moreover,  all  vector  reductions  in  Algorithm  1 
(marked  by  bullets)  are  grouped.  This  in  addition  makes  the  QMR  variant  well 
suited  for  a  parallel  implementation  (see  5.3). 


4  Storage  scheme 

Efficient  storage  schemes  for  large  sparse  matrices  depend  on  the  sparsity  pattern 
of  the  matrix,  the  considered  algorithm,  and  the  architecture  of  the  computei 
system  used  [1].  Here,  the  CRS  format  (Compressed  Row  Storage)  is  applied. 
This  format  is  often  used  in  FE  programs  and  is  suited  for  matrices  with  regular 
as  well  as  irregular  structure.  The  principle  of  the  scheme  is  illustrated  in  Fig.  1 
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Algorithm  1.  Complex  Hermitian  QMR 

p°  =  5°  =  0,  =  1,  =  -1,  U-'  =v^  =  r°  =b-  Bx° 

V  =|lr'||,  ^‘=7',  =  ^ 

i  =  1.2,... 


1  1  i  i  )  —  t 

P  -  ~  F  V 


i  1  o.  .  ,-i 

q  =  -B  w  -  — f/ 


Bp' 


T  , 
- rr 

7^ 


j  -f  1  i  ^  t 

w  =  q  —  ~w 


*/ ( Ik’  *11  <  tolerance)  then  STOP 
=  |K+‘|| 


p'+'  =  (u'‘+*)^k+' 


=  ( 


B'w' 


, ,.+1  _ 


^t  +  l  pi 
^t  +  l 


t+1  i+1 

-  7  M 


kf  (i-k) 

;>|r'|2  +  |7'  +  i|2 

v^T'\'^  +  |7'+M^ 


j/'|r'P  + 

d'  =  e'd'-^  +  k'p' 
s'  =  9's'~^  +  n'  Bp' 
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value :  |ai,i| 
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Fig.  1.  CRS  storage  scheme 


The  non-zeros  of  matrix  A  are  stored  row- wise  in  three  one-dimensional  ar¬ 
rays.  value  contains  the  values  of  the  non-zeros,  col_ind  the  corresponding 
column  indices.  The  elements  of  row_ptr  point  to  the  position  of  the  beginning 
of  each  row  in  value  and  col.ind. 


5  Parallelization  Strategies 


5.1  Data  Distribution 


The  data  distribution  scheme  considered  here  balances  both  matrix-vector  and 
vector- vector  operations  for  irregularly  structured  sparse  matrices  on  distributed 
memory  systems  (see  also  [2]).  The  scheme  results  in  a  row-wise  distribution  of 
the  matrix  arravs  value  and  col_ind  (see  4);  the  rows  of  each  processoi  succeed 
one  another.  The  distribution  of  the  vector  arrays  corresponds  component- wise 
to  the  row  distribution  of  the  matrix  arrays.  In  the  following,  nt  denotes  the 
number  of  rows  of  processor  fc,  fc  =  0, . . .  ,p  —  1;  is  the  total  numbei.  is 
the  index  of  the  first  row  of  processor  fc,  and  s,  is  the  number  onion-zeros 
of  row  i.  For  these  quantities,  the  following  equations  hold;  n  = 


f/A—  1  +  '^i=o 

In  each  iteration  of  an  iterative  method  like  JD  or  QMR,  s  spaise  matiix- 
vector  multiplications  and  c  vector- vector  operations  are  performed.  Scalar  op¬ 
erations  are  neglected  here.  With  the  data  distribution  considered,  the  load 
generated  by  row  i  is  proportional  to 


■  s  *  ^  "b  c . 


The  parameter  (  is  hardware  dependent  since  it  considers  the  ratio  of  the  costs 
for  a  regular  vector- vector  operation  and  an  irregular  matrix- vector  operation. 
However,  different  matrix  patterns  could  result  in  different  memor>-  access  costs, 
e.g..  different  caching  behavior.  Therefore,  the  parameter  C  is  determined  at  run¬ 
time  hy  timings  for  a  row  block  of  the  current  matrix  within  the  symmetric  or 
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Hermitian  QMR  solver  used.  The  measurement  is  performed  once  on  one  proces¬ 
sor  with  a  predefined  number  of  QMR  iterations  before  the  data  are  distributed. 
Wuth  approximating  (  at  run-time  for  the  current  matrix,  the  slight  dependence 
of  C  on  the  matrix  pattern  is  considered  in  addition. 

For  computational  load  balance,  each  processor  has  to  perform  the  />th  frac¬ 
tion  of  the  total  number  of  operations.  Hence,  the  rows  of  the  matrix  and  the 
vector  components  are  distributed  according  to  (3). 


mm 

l<t<n-g),  +  l 


U+Qk-i  ^  ^  k  —  0,1,  ■ . .  ,q 


ng  =  < 


?)  -  ^  Ui 
1=0 


for  k  =  q  +  1 

ioT  k  =  q  +  2, . . .  ,p-l 


(3) 


For  large  sparse  matrices  and  p  <  n,  usually  q  =  p-  lovq  +  l=  p~l  hold.  It 
should  be  noted  that  for  (  -t  0  each  processor  gets  nearly  the  same  number  of 
rows  and  for  (  -t  oo  nearly  the  same  number  of  non-zeros. 

Fig.  2  illustrates  the  distribution  of  col.ind  from  Fig.  1  as  well  as  the  dis- 
tril)ution  of  the  vectors  x  and  y  of  the  matrix-vector  multiplication  y  =  Ax  to 
four  processors  for  Q  =  5,  s  =  2,  and  c  =  13. 


y  col-ind  X _ 

Processor  0;  |yi|y2|y3|  |l||3|2||4|2|T|  \x1\x2\xz 


y  col-ind _  x 


Processor  1:  |y4|y5| 

|3|4l8|6|7|5||4|5|7l 

1 X  4  j  2-  5  1 

y 

col-ind 

X 

Processor  2:  lysly?! 

|7l4|6l|4l5|7|6l 

y 

col-ind 

X 

Processor  3: 

[h] 

Fig,  2,  Data  distribution  for  C  =  5,  s  =  2,  and  c  =  13 

In  case  of  an  heterogeneous  computing  environment,  e.g.,  workstation  clusters 
with  fast  network  connections  or  high-speed  connected  parallel  computers,  the 
data  distribution  criterion  (3)  can  easily  be  adapted  to  different  per  processor 
performance  or  memory  resources  by  predefining  weights  lOy  per  processor  k. 
Only  the  fraction  1/p  in  (3)  has  then  to  be  replaced  by  uJa  /  J2'j=o 


5.2  Communication  Scheme 

On  a  distributed  memory  system,  the  computation  of  the  matrix-vector  multi- 
idicatioiis  requires  communication  because  each  processor  owns  only  a  partial 
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vector.  For  the  efficient  computation  of  the  matrix-vector  multiplications,  it  is 
necessary  to  develop  a  suitable  communication  scheme  (see  also  [2] ) .  The  goal 
of  the  scheme  is  to  enable  the  overlapped  execution  of  computations  and  data 
transfers  to  reduce  waiting  times  based  on  a  parallel  matrix  pattern  analysis 
and.  subsequently,  a  block  rearranging  of  the  matrix  data. 

First,  the  arrays  col_ind  (see  4  and  5.1)  are  analyzed  on  each  piocessoi 
k  to  determine  which  elements  result  in  access  to  non-local  data.  Then,  the 
Itrocessors  exchange  information  to  decide  which  local  data  must  be  sent  to 
which  processors.  If  the  matrix-vector  multiplications  are  performed  row-wise, 
components  of  the  vector  x  of  y  =  Ax  are  communicated.  After  the  analysis, 
col-ind  and  value  are  rearranged  in  such  a  way  that  the  data  that  results  in 
access  to  processor  h  is  collected  in  block  h.  The  elements  of  block  h  succeed  one 
another  row-wise  with  increasing  column  index  per  row.  Block  k  is  the  first  block 
in  the  arrays  col_ind  and  value  of  processor  k.  Its  elements  result  in  access  to 
local  data:  therefore,  in  the  following,  it  is  called  the  local  block.  The  goal  of  this 
rearranging  is  to  perform  computation  and  communication  overlapped.  Fig.  3 
shows  the  rearranging  for  the  array  col_ind  of  processor  1  from  Fig.  2. 


col-ind _ 

Processor  1;  |3|4|8|6|7|5||4|5|7] 

col-ind  _ 

Reordering:  mm  m  ffl!]  E 
10  2  3 

Fig.  3.  Rearranging  into  blocks 

The  elements  of  block  1,  the  local  block,  result  in  access  to  the  local  com¬ 
ponents  4  and  5  of  x  during  the  row-wise  matrix-vector  multiplication,  whereas 
operations  with  the  elements  of  the  blocks  0,  2,  and  3  require  communication 
with  the  processors  0,  2,  and  3,  respectively.  For  parallel  matrix- vector  multi¬ 
plications,  each  processor  first  executes  asynchronous  receive-routines  to  receive 
necessary  non-local  data.  Then  all  components  of  x  that  are  needed  on  other 
processors  are  sent  asynchronously.  While  the  required  data  is  on  the  netwoik, 
each  processor  k  performs  operations  with  block  k.  After  that,  as  soon  as  non¬ 
local  data  frorn  processor  h  arrives,  processor  k  continues  the  matrix- vector 
multiplication  by  accessing  the  elements  of  block  h.  This  is  repeated  until  the 
matrix-vector  multiplication  is  complete.  Computation  and  communication  are 
performed  overlapped  so  that  waiting  times  are  reduced. 

The  block  structure  of  the  matrix  data  and  the  data  structures  for  commu¬ 
nications  hare  Ireen  optimized  for  both  the  real  and  the  complex  case  to  reduce 
memory  requirements  and  to  save  unnecessary  operations.  In  addition,  cache 
exploitation  is  improved  by  these  structures.  All  message  buffers  and  the  block 
row  pointers  of  the  matrix  structure  are  stored  in  a  modified  compressed  row 
format.  Thus  memory  requirements  per  processor  almost  proportionalh'  decrease 
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with  increasing  processor  number  even  if  the  number  of  messages  per  processor 
markedly  rises  due  to  a  very  irregular  matrix  pattern. 

A  parallel  preanalysis  phase  to  determine  the  sizes  of  all  data  structures 
proceeds  the  detailed  communication  analysis  and  the  matrix  rearranging.  This 
enables  dynamic  memory  allocation  and  results  in  a  further  reduction  of  memory- 
requirements  since  memory  not  needed  any  more,  e.g..  after  the  analysis  phases, 
can  be  deallocated.  Another  advantage  is  that  the  same  executable  can  be  used 
for  problems  of  any  structure  and  size. 

For  complex  Hermitian  problems,  two  independent  matrix-vector  products 
with  B  and  B*  have  to  be  computed  per  QMR  iteration  (see  the  framed  opera¬ 
tions  in  Algorithm  1).  Communications  for  both  oi)erations  —  they  possess  the 
same  communication  scheme  —  are  coupled  to  reduce  communication  overhead 
and  waiting  times. 

The  data  distribution  and  the  communication  scln'ine  presented  here  do  not 
require  any  knowledge  about  a  specific  discretization  mesh;  the  schemes  are 
determined  automatically  by  the  analysis  of  the  indices  of  the  non-zero  matrix 
elements. 


5.3  Synchronization 

Synchronization  overhead  is  reduced  by  grouping  inner  products  and  norm  com¬ 
putations  within  the  QMR.  and  the  JD  iteration.  For  QMR  in  both  the  real 
symmetric  and  the  complex  Hermitian  case,  special  parallel  variants  based  on 
[5]  have  been  developed  that  require  only  one  synchronization  point  per  iter¬ 
ation  step.  For  a  parallel  message  passing  implementation  of  Algorithm  1.  all 
local  values  of  the  vector  reductions  marked  by  bullets  can  be  included  into  one 
global  communication  to  determine  the  global  values. 

6  RESULTS 

All  parts  of  the  algorithms  have  been  investigated  with  various  application  prob¬ 
lems  on  the  massively  parallel  systems  NEC  Cenju-3  with  up  to  128  processors 
(64  Mbytes  main  memory  per  processor)  and  Cray  T3E  with  up  to  512  proces¬ 
sors  (128  Mbytes  main  memory  per  processor).  The  codes  have  been  written  in 
FORTRAN  77  and  C:  MPI  is  used  for  message  passing. 


6.1  Numerical  test  cases 

Numerica.1  and  performance  tests  of  the  JD  implementation  have  l.ieen  carried 
out  with  the  large  sparse  real  symmetric  matrices  Episyml  to  EpisymG  and  the 
large  sparse  complex  Hermitian  matrices  Epiherml  and  Epiherm2  stemming 
from  the  simulation  of  electron/phonon  interaction  [10].  with  the  real  symmetric 
matrices  Struct  1  to  Struct3  from  structural  mechanics  problems  (finite  element 
discretization),  and  with  the  dense  complex  Hermitian  matrix  Thinfilms  from 
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tli0  siniuldtioii  of  thin  films  with  dofocts.  The*  sniH^ilor  roal  symmotric  tost  iuh- 
tricos  Laplace,  GregCar.  CulAVil,  and  RNet  originate  fiom  finite  difference 
di.scretization  problems.  Table  1  gives  a  survey  of  all  matrices  consideied. 


Table  1.  Numerical  data  of  the  considered  large  sparse  matrice.s 


Matrix 

Properties 

Order 

Episyml 

Episym2 

EpisymS 

Episym4 

EpisymS 

EpisymG 

Epiherml 

Epiherm2 

Real  symmetric 

Real  symmetric 

Real  symmetric 

Real  symmetric 

Real  symmetric 

Real  symmetric 
Complex  Hermitian 
Complex  Hermitian 

98,800 

126,126 

342.200 

1,009.008 

5,513.508 

11,639.628 

126,126 

1,009,008 

Thinfilms 

Complex  Hermitian 

1,413 

Structl 

Struct2 

StructS 

Real  symmetric 

Real  symmetric 

Real  symmetric 

835 

2,839 

25,222 

Laplace 

GregCar 

CulWil 

RNet 

Real  symmetric 

Real  symmetric 

Real  symmetric 

Real  symmetric 

900 

1,000 

1,000 

1,000 

Number  of  non-zeros 
966,254 

1.823.812 
3,394,614 

14,770,746 

81,477,386 

172,688,506 

1.823.812 
14,770,746 


1,996,569 

13,317 

299,991 

3,856,386 


7,744 

2,998 

3,996 

6.400 


6.2  Effect  of  Preconditioning 

For  the  following  investigation  about  the  effect  on  QMR  preconditioning  on  .ID, 
the  JD  iteration  was  stopped  if  the  residual  norms  divided  by  the  initial  norms 
are  less  than  10“°. 

In  Fig.  4,  times  for  computing  the  four  smallest  eigenvalues  and  -vectors  of 
the  two  real  symmetric  matrices  Episym2  and  StructS  on  64  NEC  Cenju-3 
processors  are  compared  for  different  preconditioners. 

The  best  results  are  gained  for  JD  with  adaptive  QMR  preconditioning  and 
a  few  preceding,  diagonally  pireconditioned  outer  JD  steps  (4  or  1).  Compared 
with  pure  diagonal  preconditioning,  the  number  of  matrix- vector  multiplica¬ 
tions  required  decreases  from  6,683  to  953  for  the  matrix  Episym2  from  elec¬ 
tron/phonon  interaction.  Note  that  the  Lanezos  algorithm  used  in  the  appli¬ 
cation  code  recjuires  about  double  the  number  of  matrix-vector  multiplications 
as  QMR  preconditioned  JD  for  this  problem.  For  the  matrix  StructS  from 
structural  mechanics,  the  diagonally  preconditioned  method  did  not  converge  in 
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Fig.  4.  Different  preconditioners.  Real  symmetric  matrices  Episym2  and  StructS. 
NEC  Cenju-3.  64  processors 


100  minutes.  Note  that  in  this  case  the  adaptive  approach  is  markedly  superior 
to  preconditioning  with  a  fixed  number  of  10  QMR  iterations;  the  number  of 
matrix-vector  multiplications  decreases  from  55,422  to  11,743. 


6.3  JD  versus  Lanczos  Method 


In  Table  2,  the  secjiiential  execution  times  on  an  SGI  0^  workstation  (128  MHz. 
128  Mbytes  main  memory)  of  a  common  implementation  of  the  symmetric  Lanc¬ 
zos  algorithm  [6]  and  adaptively  QMR  preconditioned  JD  are  compared  for  com¬ 
puting  the  four  smallest  eigenvalues  and  -vectors.  In  both  cases,  the  matrices  are 
stored  in  CRS  format.  The  required  accuracy  of  the  results  was  set  close  to  ma¬ 
chine  precision.  Except  for  matrix  CulWil  —  the  Lanczos  algorithm  did  not 
converge  within  120  minutes  since  the  smallest  eigenvalues  of  the  matrix  are 
'c'ery  close  to  each  other  —  both  methods  gave  the  same  eigenvalues  and  -vectors 
within  the  recjuired  accuracy.  Since  the  Lanczos  method  applied  stores  all  Lanc¬ 
zos  vectors  to  compute  the  eigenvectors  of  A  only  the  smallest  matrices  from 
Table  1  could  be  used  for  the  comparison. 

Table  2  shows  that  the  JD  method  is  markedly  superior  to  the  Lanczos  algo¬ 
rithm  for  the  problems  tested.  Moreover,  the  results  for  the  matrices  Laplace 
and  CulWil —  some  of  the  smallest  eigenvalues  are  very  close  to  each  other  — 
appear  to  indicate  that  JD  can  handle  the  problem  of  close  eigenvalues  much 
better  than  the  Lanczos  algorithm. 
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Table  2.  Comparison  of  JD  and  the  symmetric  Lanczos  algorithm.  Sequential  execu¬ 
tion  times.  SGI  O'  workstation 


Matrix 

Lanczos 

JD 

Ratio 

Struct 1 

79.3  s 

34.4  s 

2.3 

Struct2 

5633.4  s 

899.7  s 

6.3 

Laplace 

97.4  s 

2.7  s 

36.1 

GregCar 

95.9  s 

15.7  s 

6.1 

CulWil 

— 

16.1  s 

— 

RNet 

197.9  s 

41.8  s 

4.7 

6.4  Parallel  Performance 

In  all  following  investigations,  the  JD  iteration  was  stopped  if  the  residual  norms 
divided  by  the  initial  norms  are  less  than  10~^  for  the  Cray  T3E  results). 

Fig.  5  shows  the  scaling  of  QMR  preconditioned  JD  for  computing  the  four 
smallest  eigenpairs  of  the  large  real  symmetric  electron/phonon  interaction  ma¬ 
trices  Episyml,  EpisymS,  and  Episym4  on  NEC  Cenju-3.  On  128  processors, 
speedups  of  45.5,  105.7,  and  122.3  are  achieved  for  the  matrices  with  increasing 
order;  the  corresponding  execution  times  are  8.3  s,  23.3  s,  and  168.3  s. 


Fig.  5.  Speedups.  Real  symmetric  matrices  Episyml,  EpisymS.  and  Episym4.  elec- 
tron/phoiioii  interaction.  NEC  Cenju-3 


In  Fig.  6.  speedups  of  QMR  preconditioned  JD  for  computing  the  four  small¬ 
est  eigenpairs  of  the  two  large  real  symmetric  electron/phonon  interaction  matri- 
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Fig.  6.  Speedups.  Real  symmetric  matrices  Episym4  and  EpisymS,  electron/plionon 
interaction.  Cray  T3E 


ces  Episym4  and  EpisymS  on  Cray  T3E  are  displayed.  The  problems  Episym4 
and  EpisymS  of  order  1,009,008  and  5,513,508  result  in  execution  times  of  15.2  s 
and  129.2  s.  respectively,  on  512  processors.  The  largest  real  symmetric  elec¬ 
tron/phonon  interaction  problem  EpisymS  computed  of  order  11,639,628  has 
an  execution  time  of  278.7  s  on  512  processors. 

The  effect  of  coupling  the  communication  for  the  two  independent  matrix- 
vector  multiplications  per  complex  Hermitian  QMR  iteration  (see  the  framed 
operations  in  Algorithm  1)  is  displayed  in  Fig.  7  for  computing  the  four  smallest 
eigenpairs  of  the  dense  complex  Hermitian  matrix  Thinfilms.  This  matrix  is 
chosen  since  the  problem  is  of  medium  size  and  the  matrix-vector  operations 
require  communication  with  all  non-local  processors.  In  Fig.  7,  the  execution 
times  on  NEC  Cenju-3  of  JD  with  and  without  coupling  divided  by  the  total 
number  of  matrix-vector  products  (MVPs)  are  compared. 

Communication  coupling  halves  the  number  of  messages  and  doubles  the 
message  length.  By  this,  the  overhead  of  communication  latencies  is  markedly 
reduced,  and  possibly  a  higher  transfer  rate  can  be  reached.  For  the  matrix 
Thinfilms,  coupling  gives  a  gain  of  5%  to  15%  of  the  total  time.  For  much 
larger  matrices,  gains  are  usually  very  slight  since  if  the  message  lengths  are  big 
latency  is  almost  negligible  and  higher  tranfer  rates  cannot  be  reached.  For  the 
matrix  Episym2,  e.g..  corresponding  timings  on  128  processors  give  64.5  ms 
without  coupling  and  64.4  ms  with  coupling. 

Fig.  8  shows  the  scaling  of  the  complex  Hermitian  version  of  QMR  precon¬ 
ditioned  .JD  for  computing  the  four  smallest  eigenpairs  of  the  two  large  complex 
Hermitian  electron/phonon  interaction  matrices  Epiherml  and  Epiherm2  on 
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Fig.  8.  Speedups.  Complex  Hermitian  matrices  Epiherml  and  Epiherm2.  elec¬ 
tron/ phonon  interaction.  NEC  Cenju-3 
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Processors 


Fig.  7.  Communication  coupling.  Complex  Hermitian  matrix  Thinfilms.  NEC  Cenju-3 


NEC  Cenju-3.  On  128  processors,  speedups  of  57.5  (execution  time  38.8  s)  and 
122.5  (execution  time  320.2  s)  are  achieved  for  the  matrices  Epiherml  and 
Epiherm2  of  order  126,126  and  1,009,008,  respectively. 

7  CONCLUSIONS 

By  real  symmetric  and  complex  Hermitian  matrices  from  applications,  the  ef¬ 
ficiency  of  the  developed  parallel  JD  methods  wets  demonstrated  on  massively 
parallel  systems.  The  data  distribution  strategy  applied  supports  computational 
load  balance  for  both  irregular  matrix-vector  and  regular  vector-vector  opera¬ 
tions  in  iterative  solvers.  The  investigated  communication  scheme  for  matrix- 
A’ector  multiplications  together  with  a  block  rearranging  of  the  sparse  matrix 
data  makes  possible  the  overlapped  execution  of  computations  and  data  trans¬ 
fers.  Moreover,  parallel  adaptive  iterative  preconditioning  with  QMR  was  shown 
to  accelerate  .JD  convergence  markedly.  Coupling  the  communications  for  the  tw-o 
independent  matrix-vector  products  in  the  complex  Hermitian  QMR  iteration 
halves  the  number  of  required  messages  and  results  in  additional  execution  time 
gains  for  small  and  medium  size  problems.  Furthermore,  a  sequential  compari¬ 
son  of  QMR  preconditioned  JD  and  the  symmetric  Lanczos  algorithm  indicates 
a  superior  convergence  and  time  behavior  in  favor  of  JD. 
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Abstract.  New  parallel  algorithms  for  the  evaluation  of  series  of  orthog¬ 
onal  polynomials  are  presented.  The  performance  of  these  algorithms  on 
a  message  passing  distributed  memory  computer  (a  Cray  T3D)  is  com¬ 
pared. 


1  Introduction 

The  evaluation  of  polynomials  is  one  of  the  most  common  problems  in  scientific 
computing.  Therefore,  it  has  been  extensively  studied  and  several  algorithms 
suitable  for  parallel  evaluation  have  been  proposed  [6, 9-11].  All  these  algorithms 
focus  their  attention  on  the  evaluation  of  power  series. 

In  several  scientific  applications  polynomials  do  not  appear  as  power  series, 
but  are  written  using  orthogonal  polynomials  [1]  due  to  their  special  features.  A 
parallel  algorithm  was  presented  in  [2, 3]  for  the  evaluation  of  Chebyshev  series. 

In  this  paper  we  present  two  new  algorithms  for  the  evaluation  of  polynomials 
written  as  finite  series  of  orthogonal  polynomials.  These  algorithms  are  based  on 
the  matrix  formulation  of  the  sequential  algorithms  and  afterwards  they  apply 
some  techniques  used  in  the  parallel  solution  of  tridiagonal  [14]  and  banded 
linear  systems  [5,8,12].  These  algorithms  are  given  in  Section  3  and  compared 
in  Section  4  on  a  Cray  T3D. 

A  sequence  of  orthogonal  polynomials  always  satisfies  [1]  the  triple 

recurrence  relation: 

4>r{x)  -  ar{x)(l}r-l{x)  -  Pr(l^r-2{x)  =  0,  T  >  2,  (1) 

for  some  functions  oir{x)  and  /3r-  Sequential  algorithms  for  evaluation  of  finite 
series  based  on  orthogonal  polynomials  exist  and  are  extensively  used,  such  as 
the  Clenshaw  [4]  or  Forsythe  [7]  algorithms. 

The  Forsythe  algorithm  [7]  is  based  on  a  direct  application  of  the  three-term 
recurrence  formula  (1)  and  consists  of; 

n 

'Y^Cr<Pr{x)  ■=  fn{x),  (2) 

r=0 
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<f>i{x),  fiix)  =  cotpoix)  +  ci(pi{x), 

4>r{x)  =  ar{x)  (pr-lix)  +  0r(pr-2{x).  ]  (3) 

>  r  —  2. . . .  ,n. 

fr{x)  =  fr-l{x)  +  Cr  J 

A  further  computational  algorithm  is  the  Clenshaw  algorithm  [4,13].  that 
permits  evaluation  of  a  finite  series  of  orthogonal  polynomials  by  means  of  the 
expression: 

n 

^  Cr(pr{x)  =  {Co+02q2{x)}  <Po{^)  +  qi{x)  (pl{x)  (4) 


where 


qn+i{x)  =  qn+2{x)  =  0, 


qr{x)  =Cr+ar+i{x)qr+i{x)  +  0r+2qr+2{x),  for  r  =  n,...,l. 


2  Parallel  Algorithm  to  Evaluate  Finite  Series  of 
Chebyshev  Polynomials 

The  parallel  algorithms  to  evaluate  Chebyshev  series  in  [2,  3]  are  based  on  the 
product  rules  for  the  first  {Ti{x))  and  the  second  kind  (f7,:(x))  Chebyshev  poly¬ 
nomials: 

Tm+pix)  =  2Tp{x)Tmix)  —Tm-p{x),  \ 

Uor  m  >  p.  (6) 

Um+p{x)  =  2Tp(x)  Upfi{x)  —  l/m-p(x),  j 

The  parallel  Forsythe  algorithm  to  evaluate  pi(x)  =  J2r=o  CrTr(x)  or  pi^(x)  = 
I2r=o  Ur{x)  Can  be  written,  with  n  =  A:p  -  1,  as  the  following  routine  [2]: 

STEP  I:  Processor  m  (m  =  0, . . .  ,p  —  1) 
to  =  1 

if  {Pn{x))  then 
=x 

else  if  {pp{x))  then 
ti  =2x 

end  if 

for  z  =  2, 2p  —  1 

t j,  —  2  X  tj  —  1  —  tj~'2 

end 

STEP  II;  Processor  m  (m  =  0, . . .  ,p  -  1) 

fm  =  Cm  tm  H"  Cp^m 

for  i  =  2,k 

^■ip+m  “  —  ~  ^ {i-2)p-tm 

.fm  ~  .fm  *^'7p+7^^  ^ip-\-ni 

end 
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STEP  III:  Processor  m  (m  =  0, . . . ,  p  -  1) 
red_sum_0(/mi  sum) 

In  the  algorithm,  the  number  p  is  the  number  of  processors  and  the  function 
red_sum_0(/„,su7n)  stands  for  a  global  reduction  operation,  in  this  case  the 
addition  of  /„  for  m  -  0, ...  .p  -  1,  and  writes  the  result  in  the  variable  sum  at 
the  processor  0.  The  variable  sum  gives  as  output  the  value  of  the  polynomial. 


3  Parallel  Algorithms  to  Evaluate  Finite  Series  of 
General  Orthogonal  Polynomials 

The  algorithm  to  evaluate  Chebyshev  series  is  based  on  the  product  rules  for 
the  Chebyshev  polynomials.  These  rules  are  very  simple  in  this  special  case  and 
permit  us  to  obtain  an  efficient  parallel  algorithm.  Unfortunately,  other  kinds 
of  orthogonal  polynomials  satisfy  very  cumbersome  product  rules.  Therefore,  in 
these  cases  we  must  obtain  parallel  algorithms  in  a  different  way. 

First  of  all,  we  may  remark  that  for  general  families  of  orthogonal  polynomials 
the  coefficients  in  the  recurrence  relation  (1)  may  be  calculated  at  the  same  time 
we  evaluate  the  finite  series.  This  process  may  be  done  in  parallel. 

In  Table  1  we  show  (Abramowitz  et  ai,  [1])  the  coefficients  ar(a:)  and  Pr  in 
the  case  of  the  Jacobi  polynomials  Gegenbauer  polynomials  C^{x),, 

Legendre  polynomials  Pi{x)  and  Chebyshev  polynomials  of  the  first  Ti{x)  and 
second  kind  Ui{x). 


Table  1.  Coefficients  of  the  triple  recurrence  relation  for  some  families  of  orthogonal 
polynomials.  ^ 


Qy(.'£) 

Pr 

(2r  +  Q  +  P){2r  +  q  +  /?  -  1)  ^ 

(r  +  a  —  l)(r  +  0  —  l)(2r  +  a  +  0) 

2r  {r  +  a  A-  p) 

(cv^  -  0^){2r  +  0  +  5-1) 

r  (r  +  Q  +  P){2r  +  Ct  +  0  —  2) 

'  2r  {r  +  a  +  0){2r  +  a  +  p  —  2) 

Ct{x) 

„  r  —  1  +  A 

r-2+2A 

2x - 

r 

T 

2  r  —  1 

r  —  \ 

P,{x) 

X  - 

r 

r 

T.(.x') 

2x 

-1 

UPx) 

2x 

-1 
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3.1  Parallel  Clenshaw’s  Algorithm 

The  Clenshaw  algorithm  (Eqs,  (4), (5))  can  be  formulated  using  matrix  notation. 
Let  C  be  the  matrix 


then  the  Clenshaw  algorithm  is  equivalent  to  solve  the  banded  upper  triangular 
linear  system  Cq  =  c  where  q  and  c  are  the  vectors  q"^  =  (91, 92,  •  •  ■  ,9n)  and 
=  (ci,C2. . . .  ,Cn)  and  afterwards  to  use  the  relation  (4)  to  obtain  the  value 
of  the  series. 

To  simplify  the  notation  we  suppose  that  n  =  kp,  being  p  the  number  of 
processors.  To  illustrate  the  algorithm  we  present  the  case  n  =  12  and  p  =  3, 
then  the  matrix  C  (7),  written  as  a  block  matrix,  will  be 


Now,  we  may  use  in  parallel  the  Gaussian  elimination  to  diagonalize  each 
of  the  diagonal  submatrices,  that  is,  we  apply  the  divide  and  conquer  algorithm 
(Wang  [14]),  and  we  obtain  the  system  ■  q  -  where 
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Thus,  our  problem  is  to  solve  a  reduced  linear  system  of  order  2  p  (in  this 
case  6) 


/I 

1 

Uq  f>0 

4  4 

\ 

1 

4  4 

1 

af  bl 

1 

V 

ly 

f  9i\ 

(4\ 

1 

92 

c§ 

95 

4 

<^6 

96 

99  : 

4 

\9ioy 

V^foy 

(9) 


And  finally,  with  the  values  of  qi  and  92  >  we  can  evaluate  the  polynomial 
using  (4). 

Therefore,  the  complete  algorithm  consists  of: 

STEP  I:  Processor  m  (m  =  0, . . . ,  p  -  1) 
for  i  =  m  k,  {m  +  1)  k  -  1 
evaluate(ai+2i/3t+3) 
end 

STEP  II:  Processor  p  -  1 

=  c„ 

—  Cn—\  T  (Xn  C 
=  C„_2  +  0n 

for  i  =  1 ,  fc  -  3 


+  On-i 

=  Cn-i-2  +  Pn-i 

end 

^■{p-l)k+2 

^(l-l)k+l  =  +  Ctn-k+'I  C 

4-1  =  0 

S-i  ^ 

4-1  =  0 
4-1  =  0 

Processor  m  ^  p  —  I 

=  -a(^rn+l)k+l 
=  -/3(m+l)fc+l  +Q:(rn+l)fcO^ 
U*  —  P{m+\)k  ^ 

—  ~P{m+Vik+2 
(771+  l)k 
+  l)k  ' 

C  ™  ^{7n+l)k 


^  —  ^(77t,+  1)A:^ 

r.1  — 


=  C(m  +  l)k-l  +  <a'(m+l)fc 

=  C{m  +  l)k-2  +  0{m  +  l)kC^ 

for  i  =  1 .  A;  -  3 
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STEP  III: 
STEP  IV: 


STEP  V: 


—  Cl  — i  ^ 

Cl^  —  P{m+l)k-i  O 
6^  =  6^ 

b'  —  b  -F  (y^Yn+i)k—i  b 

b  —  P(m  +  \)k  —  ib 

+  Q(m+I)fc_i 

~  C(,n+l)fc  — t  — 2  “I"  P{rn  +  \)k  —  i  ^ 

end 

^  +  amk+2 

6^-2  =  62 

6,'^„  1=5^  +  amfc+2 
Cmfc  +  2  =  c2 
^mk+1  =  +  Oimk+2 

Processor  m  (m  =  0, . . .  ,p  —  1) 

communicate_0(aj^'2_  1 ^  ]^k.-2 ^  f^k- 1  ^  ^ ) 

Processor  0 

(i3  =  (^{p-l)k+l 
94  =  C(p_i)fc+2 

for  m  =  p  -  2,0  step  -1 

91  =  C^k+l  -  “m  ^  93  -  tm  ^  94 

92  =  fc+2  -  93  -  94 

93  =  9l 

94  =  92 
end 

Processor  0 

sum  =  (co  + /02  92)^o(a:)  +qi<f>i{x) 


Where  the  function  evaluate  (a  j,;3i)  evaluates  the  values  of  the  coefficients  {a,,, Pi) 
and  communicate_i(uar)  communicates  to  processor  i  the  variables  var. 

In  this  algorithm  we  only  need  the  value  of  the  polynomial,  that  is,  we  only 
need  the  terms  q\  and  92  of  the  solution  of  the  linear  system  C  q  —  c.  For  this 
reason  we  only  have  one  communication  process. 

In  the  complexity  analysis  of  the  algorithm  we  suppose  that  the  evaluation 
of  the  coefficients  a,  and  P,,  have  a  computational  complexity  To,  and  T3,  and 
Tcom  is  the  complexity  of  each  communication  process.  Thus,  a  simple  analysis 
of  the  algorithm  gives  us  its  computational  complexity. 

Proposition  1.  The.  theoretical  computational  complexity  of  the  parallel  Clen- 
shaw  algorithm,  is 

Tp  =  [— ]  (10  +  T,  +  Tq)  +  8p  —  19  +  Tcom- 
P 
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3.2  Parallel  Forsythe’s  Algorithm 

In  the  Forsvthe  algorithm  (Eqs.  (2), (3)),  the  evaluation  of  the  orthogonal  poly¬ 
nomials  {(pi)  is  equivalent  to  solving  the  linear  system  F (p  =  en+u  where 


Afterwards  we  only  need  to  perform  the  sum  Cr  <pr  ■ 

To  simplify  the  notation  we  suppose  that  n  =  kp-1,  where  p  is  the  number 
of  processors.  To  illustrate  the  algorithm  we  present  the  case  n  =  11  and  p  =  3, 
then  the  matrix  F  (10)  will  be 


As  before,  by  using  the  Gaussian  elimination  and  the  divide  and  conquer  algo¬ 
rithm  we  obtain  the  system  F^  ■  (p  =  where  we  write  F^  using  the  same 

notation  as  (8)  and 

=  (<?5>11;01O:  ■  •  •  =  (O,  0,  .  .  .  ,  0,  C3  ,  C2  ,  Cj  .  Cq  )  (H) 

Our  problem  is  reduced  to  solving  the  linear  system  of  order  6, 


Then  we  communicate  the  solutions  to  each  processor  in  order  to  obtain  the 
values  of  all  the  orthogonal  polynomials  {<p,(a-)}  by  solving  the  different  sub¬ 
systems.  Finally,  we  obtain  the  value  of  the  polynomial  by  adding  each  partial 
sum. 
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Thus,  the  complete  algorithm  consists  on; 

STEP  I:  Processor  m  (m  =  0, . . . ,  p  -  1) 

for  i  =  {p  -  m  -  1)  k,  {p  —  m)  k  -  I 
evaluate(Qi,^i) 
end 

STEP  II:  Processor  p  -  1 

Co«  =  l 
cf  =  ai 
c§  =  (32 
for  i  =  1 .  A:  -  3 

cf+ 1  =  cf+ 1  +  Qi+i  cf 
cf+2  =  /3,:+2  cf 
end 

cf_i  =  cf_i  +a/c-icf_2 

«p:}=o 

ip-i=0 

Processor  m  ^  p  -  1 

~  ~Ot{p-m~\.)k  ^ 

~  ~0(p-m—l}k+l  3-  0!(^p^rn-l)k+l 
~  0(p-m-l)k+2 
~  ~ P{p—m  —  l)k 
~  ^{p-m~l)k+l 
“  P(p-m-\)k+2^m 

for  i  =  1 ,  A:  -  3 

~  "k  (^(p-m.-l)k+i+l 

~  0ip-m-l)k+i+2 

T  ^{p-m-l)k+i+l 

(3{p—m  —  l)k-\-i+2^m 

end 

^tn  ~  T  ^{p~m)k  —  l 

~  T  ^(p~m)k-l  ^rn 

pR  _  0 

^(p-m)k-2  -  ^ 

=0 

STEP  III:  Processor  m  (m  =  0, . . .  ,p  -  1) 

communicate_0(a^-2,  cf,_,„);._  j) 

STEP  IV:  Processor  0 

(pk-2  =  cf_2 
<Pk-l  =  cf_i 

for  j  —  2,  p 

<Pjk-2  =  -^pZ]  <i)(j-i)k-i  -  bpZj  (p(j-i)k-2 

(pjk-1  =  -«pl]  0(j-l)fe-l  -  b^z]  <(>(j-l)k-2 

end 
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4>-2  =  0 
</)_!=  0 

STEP  V:  Processor  0 

communicate_m(  0(p  _  m  - 1 )  A:  -  2  >  -  »n  - 1 )  fc  - 1 ) 

STEP  VI:  Processor  p  -  1 

Sp-i  =  0 

for  i  =  0,  fc  —  1 

07  =  cf 

Sp—i  —  Sp— 1  +  Ci  (j)i 

end 

Processor  m  5^  p  -  1 

Sm  —  0 

for  i  =  0,  A:  -  1  , 

4‘{p-m-))k+i  =  -</'(p-m-l)fe-l  “m  "  0(p-Tn- l)fe-2 
Sm  —  Sm  +  C(p_nx-l)fc+i  0(p-m-l)/c+i 

end 

STEP  VII:  Processor  m  (m  =  0, . . .  ,p  -  1) 

red_sum_0(sTO,  swm) 

As  before,  a  simple  analysis  of  the  algorithm  gives  us  its  computational  com¬ 
plexity. 

Proposition  2.  The  theoretical  computational  complexity  of  the  parallel  Forsythe 
algorithm  is 

Tp  =  r-1  (11  +  Ta  +  Tp)  +  6p  -  15  +  STcom- 
P 

This  algorithm  has  a  greater  computational  complexity  than  the  parallel 
Clenshaw  and  also  it  requires  more  global  communication  processes. 


4  Numerical  Tests 

The  algorithms  presented  here  have  been  tested  on  a  Cray  T3D  at  the  Edinburgh 
Parallel  Computing  Centre  (EPCC),  using  up  to  128-PE  with  Message  Passing 
Interface  (MPI)  as  the  parallel  environment.  This  computer  is  hosted  by  a  Cray 
Y-MP  system.  Each  T3D  PE  consists  of  a  DECchip  21064  Alpha  processor  with 
64Mb  of  memory. 

In  Figure  1  we  present  the  Speed-up  {Sp  =  Ti/Tp,  where  Ti  is  the  evalu¬ 
ation  time  using  the  sequential  Clenshaw  algorithm)  for  the  parallel  Forsythe 
algorithm  to  evaluate  finite  series  of  Chebyshev  polynomials  of  the  first  kind. 

In  Figures  2  and  3  we  show  the  Speed-up  for  the  parallel  Clenshaw  algorithm 
to  evaluate  finite  series  of  Jacobi,  Gegenbauer  and  Legendre  polynomials,  and 
in  Figures  4  and  5  we  show  the  same  for  the  parallel  Forsythe  algorithm.  The 
performance  of  the  parallel  Clenshaw  algorithm  is  better  than  the  performance 
of  the  parallel  Forsythe  algorithm.  Vi/e  observe  that  the  time  to  evaluate  the 
coefficients  q,  and  5,:  of  the  triple  recurrence  permits  us  to  obtain  good  speed-up 
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results,  taking  into  account  that  the  parallel  algorithms  to  evaluate  orthogonal 
series  have  a  bigger  computational  complexity  than  the  sequential  ones.  In  the 
case  of  Chebyshev  series  the  parallel  algorithm  has  the  same  complexity  as  the 
sequential  one  but  in  this  case  we  do  not  need  to  evaluate  the  coefficients. 

In  order  to  see  the  influence  of  the  communication  times  in  the  parallel  algo¬ 
rithms,  we  present  in  Figures  6,7  and  8  the  speed-up  without  the  communication 
process.  For  low  degree  polynomials  we  observe  that  the  communication  process 
takes  more  time  than  the  evaluation  when  we  have  a  high  number  of  processors. 


degree 


degree 


Fig.  1.  Speed-up  in  the  parallel  evaluation  on  a  CRAY  T3D  of  several  Chebyshev 
polynomial  series  using  the  parallel  Forsythe  algorithm. 
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— 0 —  Jacobi 
— X —  Gegenbauer 
— A —  Legendre 


Fig.  2.  Speed-up  in  the  parallel  evaluation  on  a  CRAY  T3D  of  several  polynomials 
using  the  parallel  Clenshaw  algorithm. 


Fig.  3.  Speed-up.  depending  on  the  number  of  processors,  in  the  parallel  evaluation 
on  a  CRAY  T3D  of  several  polynomials  using  the  parallel  Clenshaw  algorithm  (Jacobi 
polynomial  series  -o-,  Gegenbauer  polynomial  series  -x-  and  Legendre  polynomial 
series  -A-). 
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— o —  Jacobi 
— X —  Gegenbauer 
— A —  Legendre 


Fig.  4.  Speed-up  in  the  parallel  evaluation 
using  the  parallel  Forsythe  algorithm. 
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Fig.  5.  Speed-up,  depending  on  the  number  of  processors,  in  the  parallel  evaluation 
on  a  CRAY  T3D  of  several  polynomials  using  the  parallel  Forsythe  algorithm  (Jacobi 
polynomial  series  -o-.  Gegenbauer  polynomial  series  -x-  and  Legendre  polynomial 
series  -A-). 
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Fig.  6.  Speed-up,  without  the  communication  time,  in  the  parallel  evaluation  on  a 
CRAY  T3D  of  several  Chebyshev  polynomial  series, 


Fig.  7.  Speed-up,  without  the  communication  time,  in  the  parallel  evaluation  on  a 
CRAY  T3D  of  several  polynomials  using  the  parallel  Clenshaw  algorithm. 
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Fig.  8.  Speed-up.  without  the  communication  time,  in  the  parallel  evaluation  on  a 
CRAY  T3D  of  several  polynomials  using  the  parallel  Forsythe  algorithm. 
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Coarse-grain  parallelization  of  a  multi-block 
Navier-Stokes  solver  on  a  shared  memory 
parallel  vector  computer 
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Abstract.  The  coarse-grain,  or  block-loop  parallelization  of  the  multi¬ 
block  Navier-Stokes  flow  solver  ENSOLV  on  a  NEC  SX-4,  a  shared  mem¬ 
ory  parallel  vector  computer,  is  discussed.  The  performance  of  the  paral¬ 
lel  code  was  tested  by  running  the  code  on  ten  benchmark  cases,  provided 
by  the  ENSOLV  user  group.  The  performance  is  measured  in  terms  of 
speed-up,  memory  usage  and  execution  cost.  The  results  of  the  bench¬ 
mark  cases  are  presented.  The  results  are  compared  to  those  of  the  low- 
level  DO-loop  parallelization  implemented  earlier.  The  conclusion  based 
on  the  comparison  of  the  results,  is  that  for  all  benchmark  cases,  except 
the  single  block,  the  block-loop  parallelization  gives  better  performance 
in  terms  of  speed-up.  Although  block-loop  parallelization  requires  more 
memory,  it  gives  overall  less  execution  cost. 


1  Introduction 

The  multi-block  Navier-Stokes  flow  solver  ENSOLV  [2],  [4],  computes  the  solu¬ 
tion  of  the  steady  3D  Euler  and/or  thin-layer  Navier-Stokes  equations  in  an 
arbitrary  flow  domain.  The  Euler  and  Navier-Stokes  equations  are  given  by  five 
partial  differential  equations  for  the  conservation  of  mass,  3D  momentum  and 
energy,  extended  by  the  perfect  gas  law.  To  solve  the  equations,  an  iterative 
procedure  which  resembles  time  integration  is  used.  A  number  of  techniques  are 
employed  to  accelerate  the  convergence: 

1.  A  multigrid  scheme,  which  performs  relaxations  on  different  grid  levels,  is 
used  as  solution  procedure.  This  accelerates  the  convergence  on  the  finest 
grid  level.  As  relaxation  procedure,  the  explicit  Runge-Kutta  time  stepping 
scheme  is  used; 

2.  The  evaluation  of  the  time  step,  needed  for  the  Runge-Kutta  scheme,  is 
performed  locally; 

3.  Implicit  residual  averaging  with  varying  coefficients  and  enthalpy  damping 
are  used. 

The  solver  is  based  on  multi-block  structured  grids.  Multigrid  is  applied 
around  multi-block,  i.e.  on  each  grid  level  a  loop  on  the  blocks  is  performed. 
The  Runge-Kutta  scheme  is  applied  on  a  block-by-block  basis.  This  means  that 
a  relaxation  of  all  blocks  consists  of  taking  one  complete  Runge-Kutta  time  step 
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for  each  block  successively,  keeping  the  flow  states  in  the  other  blocks  fixed.  The 
flow  solver  ENSOLV  is  currently  operational  at  NLR  and  industry. 

Within  the  NICE^  program,  ENSOLV  is  being  parallelized  in  order  to  reduce 
execution  cost.  Parallelization  takes  place  on  a  16-processor  NEC  SX-4  [9],  a 
shared  memory  parallel  vector  computer,  with  a  peak  performance  of  2  GFlop/s 
per  processor.  In  [5],  ten  representative  benchmark  cases  were  defined  by  the 
ENSOLV  user  group,  which  constitute  the  benchmark  for  evaluating  the  par¬ 
allelized  version  of  the  ENSOLV  code.  The  performance  of  the  parallel  code  is 
measured  in  terms  of  speed-up,  memory  usage  and  execution  cost.  At  NLR.  ex¬ 
ecution  cost  are  expressed  in  a  single  number,  so-called  System  Resource  Lnits 
(SR.U’s).  In  the  SRU’s,  the  sum  of  all  CPU-times,  the  amount  of  memory  used 
and  the  time  the  memory  is  occupied,  are  accounted  for;  the  formula  reflects 
the  cost  price  of  the  system  elements  [1].  Note  that  the  sum  of  all  CPU-times  is 
always  larger  when  parallelization  is  applied.  If  the  parallelized  code  will  result 
in  a  reduction  in  real  time,  by  the  same  factor  as  the  increase  in  memory  usage, 
the  SRU’s  should  stay  constant.  A  detailed  explanation  of  the  SRU  formula,  as 
used  for  the  calculations  of  the  SRU’s  reported  in  this  document,  can  be  found 
in  [13]. 

The  Data  Parallelism  strategy  for  parallelizing  ENSOLV  was  chosen  [8].  With 
this  strategy,  parallelism  is  obtained  by  splitting  up  the  DO-loop’s.  Splitting  up 
the  DO-loop’s  is  specifically  suited  for  shared  memory  computers,  such  as  the 
shared  memory  parallel  vector  machine  NEC  SX-4,  present  at  NLR. 

There  are  different  levels  of  DO-loop  parallelization,  two  of  which  are: 

1.  Low-level  DO-loop  parallelization,  parallelization  of  DO-loops  in  individual 
routines.  A  possible  problem  is  the  fine  parallel  grain  size;  the  work  per  loop 
might  not  be  enough  to  overcome  the  parallel  overhead.  Also,  the  paralleliza¬ 
tion  has  to  be  implemented  on  many  loops  in  order  to  achieve  an  acceptable 
parallelization  percentage; 

2.  Block-loop  parallelization,  parallelization  of  the  DO-loop’s  over  the  blocks  in 
the  domain.  This  can  be  considered  as  high-level  DO-loop  parallelization.  It 
results  in  the  largest  possible  grain  size.  A  possible  problem  is  load  imbal¬ 
ance.  The  ENSOLV  code  uses  a  multigrid  algorithm,  which  is  implemented 
around  the  multi-block  algorithm.  The  operations  of  the  multigrid  algorithm 
are  relaxation,  restriction  and  prolongation.  The  routines  performing  these 
operations  all  contain  block-loops.  Therefore,  this  parallelization  strategy  is 
applicable. 

Earlier,  ENSOLV  has  been  parallelized  using  the  low-level  DO-loop  paralleliza¬ 
tion  strategy.  This  parallelization  is  described  in  [11].  The  parallelization  resulted 
in  poor  performance  in  terms  of  speed-up  and  execution  cost,  for  most  bench¬ 
mark  cases.  For  benchmark  cases  with  a  relatively  high  number  of  multigrid 
levels,  combined  with  many  small  blocks  in  the  grid,  the  poor  performance  was 
attributed  to  the  large  parallel  overhead  caused  by  the  very  fine  grain  size.  It 
was  decided  that  block-loop  parallelization  would  be  implemented.  In  Chapter  2. 

Netherlands  Initiative  for  Computational  Fluid  Dynamics  in  Engineering  with  HPCN 
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the  block-loop  parallelization  of  ENSOLV  will  be  described  briefly.  Also,  the  sys¬ 
tem  into  which  the  resulting  parallel  code,  along  with  tools  for  task  estimation, 
task  allocation  and  speed-up  estimation,  was  integrated,  will  be  described.  In 
Chapter  3,  the  benchmark  cases  will  be  described  and  remarks  are  made  about 
the  expected  performance  of  the  parallel  code  for  these  benchmark  cases.  In 
Chapter  4,  the  results  of  testing  the  block-loop  parallel  ENSOLV  code  on  the 
benchmark  cases  are  presented  and  discussed.  In  Chapter  5,  the  final  conclusions 
are  given. 


2  The  Parallel  ENSOLV  System 

In  this  section,  the  block-loop  parallelization  of  ENSOLV  is  described  briefly. 
A  more  extensive  description  of  the  parallelization  can  be  found  in  [13].  The 
resulting  parallel  code  was  integrated  into  a  system  including  tools  for  task 
estimation,  task  allocation  and  speed-up  estimation. 


2.1  Block- loop  paredlelization  of  ENSOLV 

Implementing  block-loop  parallelization,  in  stead  of  low-level  DO-loop  paral¬ 
lelization,  has  some  consequences  that  need  to  be  examined. 

1.  To  eliminate  the  dependency  between  time  integration  in  the  blocks,  the 
Gauss-Seidel  algorithm  is  replaced  with  the  Jacobi  algorithm.  This  means 
that  when  updating  the  flow  state  of  one  block,  the  flow  states  from  the 
prior  Runge-Kutta  time  step  in  the  adjacent  blocks  are  used,  in  stead  of 
the  most  recent  flow  states.  Implementing  a  different  solution  procedure 
will  generally  change  both  convergence  and  stability,  but  should  result  in 
the  same  final  solution.  However,  in  order  to  allow  a  fast  evaluation  of  the 
block-loop  parallelization  of  ENSOLV,  a  simplified  implementation  of  the 
Jacobi  algorithm  was  used,  resulting  in  a  slightly  different  final  solution  (in 
particular  near  block  interfaces)  [3].  Results  of  the  serial  ENSOLV  code  using 
this  implementation  of  the  Jacobi  algorithm,  can  be  found  in  Tables  5-14; 

2.  A  significant  increase  in  memory  usage  is  unavoidable;  computing  the  blocks 
in  parallel  means  that  each  processor  needs  its  own  scratch  arrays.  For  all 
benchmark  cases,  except  the  single  block  benchmark  case  02,  the  memory 
size  is  approximately  doubled  when  run  on  eight  processors; 

3.  Since  blocks  differ  in  the  number  of  grid  points,  the  model  used,  boundary 
conditions  applied  etc.,  a  load  balancing  problem  may  occur.  Implement¬ 
ing  a  load  balancing,  or  task  allocation  tool  will  improve  the  load  balance 
(Section  2.2); 

4.  The  maximum  speed-up  that  can  be  obtained,  is  limited  to  the  number  of 
blocks  used,  if  the  number  of  blocks  is  less  than  the  number  of  processors. 
Also,  if  a  case  has  one  large  block  and  many  small  blocks,  the  maximum 
speed-up  is  limited  by  the  work  load  of  the  large  block. 
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The  block-loops  were  parallelized  by  splitting  these  single  loops  in  double 
loops;  the  outer  loop  over  the  processors  and  the  inner  loop  over  the  blocks  as¬ 
signed  to  that  processor  by  the  task  allocation  tool  (Section  2.2).  The  outer  loops 
were  parallelized  by  inserting  *odir  directives,  recognizable  only  to  the  NEC  For¬ 
tran  compiler  and  therefore  leading  to  a  portable  code.  No  message  passing  code 
is  necessary,  since  the  parallelization  takes  place  on  a  shared  memory  computer. 
The  NEC  SX-4  preprocessor  now  generates  the  parallel  code. 

2.2  Integration  of  parallel  ENSOLV 

The  code  was  integrated  into  a  system,  including  tools  for  task  estimation,  task 
allocation  and  speed-up  estimation.  The  current  work  was  carried  out  by  oper¬ 
ating  this  system  through  a  specific  working  environment,  ISNaS  [6],  where  the 
calculations  can  be  started  by  simple  drag-and-drop  actions. 


Task  estimation  Initially,  the  work  load,  or  the  weight  for  each  block  was  set 
equal  to  the  number  of  grid  points.  This  is  reasonable  under  the  assumption  that 
the  work  in  a  block  is  proportional  only  to  the  number  of  grid  points,  and  all 
blocks  are  active  in  all  parallel  parts  of  the  code.  With  ENSOLV,  this  assumption 
proved  to  be  incorrect;  if  two  blocks  have  the  same  number  of  grid  points,  but 
not  the  same  ordering  of  their  dimensions  in  the  grid,  their  work  loads  can  be 
different,  due  to  a  difference  in  vectorization  performance. 

The  present  task  estimation  tool  performs  (at  least)  one  iteration  of  the 
block-loop  parallel  ENSOLV  code,  including  timing-commands.  The  work  load 
for  each  block  is  then  set  equal  to  the  time  it  spends  in  the  block-loops. 


Task  allocation  In  order  to  improve  the  load  balance,  a  task  allocation  tool 
was  implemented.  This  task  allocation  tool  is  a  stand-alone  partitioning  tool, 
based  on  Ozturan  algorithm  [7],  adapted  for  shared  memory  machines  [12].  The 
algorithm  starts  from  an  existing  partitioning  of  the  blocks.  It  then  re-locates 
blocks  until  a  satisfying  (theoretically)  load  balance  is  reached,  or  there  is  no 
more  improvement  possible. 


Speed-up  estimation  An  estimation  of  the  maximal  attainable  speed-up  can 
be  made  following  task  estimation  and  task  allocation.  An  approximation  of  the 
parallel  part  of  the  code  can  be  obtained  by  adding  the  work  loads  for  all  blocks. 
We  can  now  calculate  the  maximal  attainable  speed-up  using  Amdahl: 


S  = 


•  maxp  Wp 


+  (!-/))-' 


(1) 


where  /  equals  the  fraction  representing  the  parallel  part,  Np  equals  the  number 
of  processors  and  Wp  equals  the  work  assigned  to  processor  P. 
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3  Settings  for  evaluation  of  parallel  ENSOLV 

In  this  section,  the  characteristics  of  the  benchmark  cases  are  given.  The  tools  in 
the  parallel  ENSOLV  system  are  used  to  calculate  maximal  attainable  speed-ups. 


3.1  Characteristics  of  the  benchmzirk  cases 

For  the  performance  tests  on  the  NEC  SX-4,  a  set  of  test  problems  has  been 
defined  [5].  The  characteristics  of  these  benchmark  cases  can  be  found  in  Table  1. 
In  Fig.  1,  the  benchmark  cases  are  identified  by  configuration  and  number  of 
blocks. 


W/B/f^ 


2  3  4  5  6  7 

benchmark  case 


Fig.  1.  Identification  of  benchmark  cases  by  configuration  and  number  of  blocks 


3.2  Maximal  attainable  speed-ups 

In  Tables  2  and  3,  the  task  allocations  calculated  by  the  task  allocation  tool, 
discussed  in  Section  2.2,  can  be  found.  The  work  loads  for  benchmark  case  05, 
measured  by  using  one  iteration  only,  were  relatively  small.  This  can  lead  to 
inaccuracies,  e.g.  when  calculating  the  fraction  /  representing  the  parallel  part. 
In  order  to  reduce  inaccuracies,  the  calculations  for  benchmark  case  05  were 
done  for  the  full  500  iterations.  In  Table  4,  the  maximal  attainable  speed-up 
calculated  with  Equation  1  can  be  found. 

It  is  expected  that  only  for  benchmark  case  02,  a  single  block  case,  the  block- 
loop  parallelization  will  lead  to  significantly  worse  speed-up,  compared  to  low- 
level  DO-loop  parallelization.  For  all  other  benchmark  cases,  block-loop  paral¬ 
lelization  is  expected  to  lead  to  an  improvement  in  speed-up  (Fig.  2). 
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Fig.  2.  Speed-up  results  for  eight  processors;  estimated  block-loop  versus  measured 
low-level  DO-loop 


4  Results 

The  block- loop  parallelization  results  for  all  ten  benchmark  cases,  for  1-,  4-  and 
8-processor  runs,  are  shown  in  Tables  5-14. 

In  Tables  5-14,  the  following  definitions  are  used: 

-  The  Parallelization  Overhead  is  defined  as  the  ratio  of  the  real  time  needed 
by  the  parallel  version  run  on  1  processor  and  the  real  time  needed  by  the 
serial  version; 

-  The  Speed-up  for  N  processors  is  defined  as  the  ratio  of  the  real  time  of  the 
serial  version  and  the  real  time  of  the  parallel  version  run  on  N  processors; 

-  The  Memory  Overhead  is  defined  as  the  ratio  of  the  amount  of  memory 
needed  by  parallel  ENSOLV  on  N  processors  and  the  amount  of  memory 
needed  by  the  serial  version. 

All  real  time  results  are  timings  of  the  iteration  part  of  the  solver,  output  to  the 
ENSOLV  output  file  OUT. 

In  the  following  sections,  the  speed-up,  memory  usage  and  execution  cost 
are  compared  to  low-level  DO-loop  parallelization  results.  Not  all  the  results  of 
low-level  DO-loop  parallelization  are  listed  here,  the  reader  is  referred  to  [11]. 


4.1  Speed-up  results 

For  all  benchmark  cases,  except  the  single  block  benchmark  case  02.  block-loop 
parallelization  shows  better  performance  in  terms  of  speed-up,  compared  to  low- 
level  DO-loop  parallelization. 

The  remaining  differences  in  speed-up  estimations  and  measurements  are 
attributed  to  the  fact  that  the  task  estimation  tool  uses  only  one  iteration. 
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For  benchmark  case  05  the  full  500  iterations  were  used,  and  the  differences 
are  minimal.  The  required  speed-up  of  4.8  for  eight  processors,  defined  by  the 
ENSOLV  user  group,  is  attained  by  seven  of  the  ten  benchmark  cases  (Fig.  3). 
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Fig.  3.  Speed-up  results  for  eight  processors;  measured  block-loop  versus  estimated 
block-loop  and  measured  low-level  DO-loop 


4.2  Memory  usage 

As  expected,  the  memory  usage  increases  considerably  for  all  benchmark  cases, 
except  the  single  block  benchmark  case  02  (Tables  5-14).  Of  course,  the  memory 
usage  does  not  further  increase  when  the  number  of  processors  is  larger  than 
the  number  of  blocks.  Benchmark  case  10  shows  the  largest  increase  in  memory 
usage.  For  all  benchmark  cases,  the  memory  usage  was  smaller  than  the  maximal 
available  memory  on  the  NEC  SX-4. 


4.3  Execution  cost 

The  execution  cost  for  block-loop  parallelization  are  considerably  lower  for  all 
benchmark  cases,  except  for  the  single  block  benchmark  case  02. 

For  eight  of  the  ten  benchmark  cases,  the  cost  for  the  parallel  execution  of 
ENSOLV  on  eight  processors  are  equal  to  or  less  than  the  cost  for  serial  execution 
of  ENSOLV  (Fig.  4).  For  the  large  memory  benchmark  case  07,  the  cost  of  the 
parallel  runs  are  considerably  lower  than  the  cost  of  the  serial  run.  This  is  due 
to  the  construction  of  the  SRU  formula  [13]. 
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Fig.  4.  Ratio  of  execution  cost  on  eight  processors  and  cost  of  sequential  execution; 
measured  block-loop  versus  measured  low-level  DO-loop 


5  Conclusions  and  future  work 

Block-loop  parallelization  has  been  used  for  parallelizing  the  multi-block  Navier- 
Stokes  flow  solver  ENSOLV.  The  parallel  code  was  integrated  into  a  system, 
including  tools  for  task  estimation,  task  allocation  and  speed-up  estimation. 
Future  users  will  be  able  to  operate  this  system  through  a  specific  working 
environment,  ISNaS  [6],  where  the  calculations  can  be  started  by  simple  drag- 
and-drop  actions. 

The  block- loop  parallelized  code  was  tested  on  ten  benchmark  cases.  The 
performance  was  measured  in  terms  of  speed-up,  memory  usage  and  execution 
cost,  and  compared  to  the  performance  of  the  low-level  DO-loop  parallelized 
code  implemented  earlier. 

All  benchmark  cases,  except  the  single  block  benchmark  case  02,  show  better 
performance  in  terms  of  speed-up  compared  to  low-level  DO-loop  parallelization. 
For  seven  of  the  ten  benchmark  cases,  the  speed-up  for  eight  processors  is  higher 
than  the  the  user  required  value  of  4.8. 

For  all  benchmark  cases,  except  the  single  block  benchmark  case  02,  memory 
usage  increases  considerably  when  using  block-loop  parallelization  in  stead  of 
low-level  DO-loop  parallelization,  as  was  foreseen. 

The  block-loop  parallelization  gives  better  or  comparable  performance  in 
terms  of  execution  cost,  than  the  low-level  DO-loop  parallelization,  for  all  bench-- 
mark  cases,  except  the  single  block  benchmark  case  02.  For  six  of  the  ten  bench¬ 
mark  cases,  the  execution  cost  for  parallel  runs  is  lower  than  or  comparable  to 
the  execution  cost  for  the  sequential  run. 

Based  on  the  results,  it  was  decided  not  to  implement  a  single  parallelization 
approach,  combining  both  previously  applied  parallelization  strategies;  low-level 
DO-loop  parallelization  for  larger  blocks,  block-loop  parallelization  for  several 
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smaller  blocks. 
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Table  1.  Characteristics  of  the  benchmark  cases  (w=wing,  w/b=wing-body, 
w/b/n=wing-body-nacelle,  w/b/n/p=wing-body-nacelle-pylon,  BL=Baldwin-Lomax, 
CS=Cebeci-Smith,  JK=Johnson-King) 


[gagg 

ident 

Config. 

2D/3D  Blocks  Mcells  Multigrid  Euler/TLNS  Tur. 

Mod.  Iter. 

01 

RAE2822 

aerofoil 

HIM 

8 

0.010 

3 

T(j) 

BL 

400 

02 

Delta 

w 

3D 

1 

0.369 

3 

T(k) 

BL 

200 

03 

AS28g 

w/b/n 

3D 

62 

1.556 

2 

E 

- 

500 

04 

Onera  M6 

w 

3D 

4 

0.786 

4 

T(j) 

CS 

80 

05 

F16 

a/c 

3D 

57 

2.084 

1 

E 

- 

500 

06 

F16 

a/c+stores 

3D 

86 

2.084 

2 

E 

- 

360 

07 

VTP4 

w/b 

3D 

38 

6.636 

4 

T(j) 

BL 

100 

08 

VTP4 

w/b/n 

3D 

105 

1.455 

3 

T(j) 

JK 

100 

09 

Model  10 

w/b/n/p 

3D 

106 

2.211 

3 

T(i,j,k) 

CS 

100 

10 

Duprin 

w/b/n/p 

3D 

21 

0.577 

2 

E 

- 

100 

Table  2.  Task  allocations  for  four  processors,  with  Wp  equal  to  the  work  load  of 
processor  P,  the  maximum  given  in  bold 


case 

02 

03 

04 

05 

06 

07 

tm 

w. 

36.20 

11.35 

29.76 

18.19 

HUKKa 

42.50 

57.12 

5.25 

W2 

Blici 

0 

11.27 

29.69 

715.86 

BBciil 

42.33 

57.69 

4.64 

Wa 

0.17 

0 

17.43 

726.24 

17.97 

112.44 

42.49 

57.63 

5.34 

W4 

0.11 

0 

11.44 

17.41 

717.94 

18.21 

111.62 

42.29 

57.48 

4.36 

2867.34 

niBnai 

229.92 

iElEgl 

Table  3.  Task  allocations  for  eight  processors,  with  Wp  equal  to  the  work  load  of 
processor  P,  the  maximum  given  in  bold 


02 

04 

05 

07 

08 

09 

Wi 

36.20 

5.44 

29.76 

354.89 

9.00 

56.12 

20.50 

28.79 

2.57 

W2 

0 

5.49 

29.69 

383.44 

9.19 

55.27 

20.31 

WiliTil 

2.70 

Wa 

0 

IrJitil 

17.43 

355.95 

8.96 

54.23 

20.45 

28.89 

2.08 

W4 

ora 

0 

5.47 

17.41 

352.33 

8.99 

56.11 

21.13 

28.75 

2.21 

Wa 

0 

5.54 

353.48 

8.91 

58.24 

20.32 

28.61 

2.59 

We 

0 

5.46 

EH 

351.00 

9.07 

54.72 

20.27 

28.54 

2.76 

W? 

0 

5.41 

EH 

366.18 

9.02 

56.89 

26.00 

28.47 

2.49 

W» 

0 

5.45 

H 

350.07 

9.13 

54.01 

20.63 

28.87 

2.19 

lEtl 

nimm 

45.36 

94.29 

2867.34 

72.27 

445.59 

169.61 

229.92 
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Table  4.  Maximal  attainable  speed-ups  for  four  and  eight  processors,  with  f  the 
fraction  representing  the  parallel  part  of  the  code 


case 

01 

02 

03 

04 

05 

06 

07 

08 

09 

10 

/ 

0.9394 

0.9981 

0.9908 

0.9890 

0.9949 

0.9922 

0.9932 

0.9957 

0.9962 

0.9894 

54 

3.08 

1.00 

3.86 

3.09 

3.89 

3.88 

3.88 

3.94 

3.94 

3.57 

Ss 

3.22 

1.00 

6.09 

3.09 

7.24 

7.47 

7.32 

6.37 

7.73 

6.67 

Table  5.  Parallel  performance  for  case  01,  400  iterations 


#proce*«ors 

••quential 

or 

paralla) 

•xaeutien 

(raal) 

time 

parallel 

overhead 

speed-up 

MFLOPS 

memory 

usage 

(MB) 

memory 

overhead 

SRU 

1 

1 

4 

8 

sequential 

parallel 

parallel 

parallel 

118 

124 

52 

55 

1.05 

1.00 

0.95 

2.27 

2.15 

237 

226 

533 

508 

24 

25 

40 

54 

1.04 

1.67 

2.25 

1475 

1553 

2543 

5266 

Table  6.  Parallel  performance  for  case  02,  200  iterations 


#proeesserB 

sequential 

or 

parallel 

execution 

(real) 

time 

parallel 

overhead 

speed-up 

MFLOPS 

memory 

usage 

(MB) 

memory 

overhead 

SRU 

■1 

sequei\tial 

parallel 

parallel 

parallel 

717 

865 

712 

962 

1.21 

1.00 

0.83 

1.01 

0.75 

485 

436 

484 

364 

195 

212 

203 

212 

1.09 

1.04 

1.09 

11297 

12121 

14052 

33396 

Table  7.  Parallel  performance  for  case  03,  500  iterations 


#processors 

sequential 

or 

parallel 

execution 

(real) 

time 

parallel 

overhead 

speed-up 

MFLOPS 

memory 

usage 

(MB) 

memory 

overhead 

SRU 

2256 

1,00 

666 

249 

38296 

2327 

1.03 

0.97 

652 

266 

1.07 

603 

3.74 

2488 

376 

1.51 

33385 

8 

parallel 

456 

4.95 

3291 

465 

1.87 

35373 

Table  8.  Parallel  performance  for  case  04,  80  iterations 


— 

iii 
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m  1  ||||H 

lii  1 

ju 

is  1  1^1 
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Table  9.  Paiallel  performance  for  case  05,  500  iterations 


#procas*or« 

sequantial 

parallel 

•xecution 

time 

parallel 

overhead 

■peed'up 

MFLOPS 

memory 

usage 

(MB) 

memory 

overhead 

SRU 

sequent)  al 

2873 

1.00 

644 

241 

- 

48331 

parallel 

2882 

1.00 

1.00 

643 

241 

1,00 

48412 

parallel 

768 

. 

3.74 

2402 

357 

1.48 

40759 

parallel 

405 

* 

7.09 

4541 

502 

40594 

Table  10.  Parallel  performance  for  case  06,  360  iterations 


#processors 

sequential 

or 

parallel 

execution 

(r.al) 

time 

parallel 

overhead 

speed-up 

MFLOPS 

memory 

usage 

(MB) 

memory 

overhead 

SRU 

sequential 

2600 

. 

1.00 

618 

282 

- 

45689 

parallel 

2648 

1.02 

0.98 

608 

300 

1.06 

40254 

parallel 

684 

- 

3.80 

2324 

434 

1.54 

38595 

parallel 

443 

• 

5.87 

3584 

584 

2.07 

39109 

Table  11.  Parallel  performance  for  case  07,  100  iterations 


# processors 

sequential 

execution 

parallel 

speed-up 

MFLOPS 

memory 

memory 

SRU 

or 

(real) 

overhead 

usage 

overhead 

parallel 

time 

(MB) 

^S5jS|j5S5 

sequential 

4430 

1.00 

621 

859 

- 

129283 

parallel 

4593 

1.04 

0.96 

617 

859 

130170 

parallel 

1270 

. 

3.49 

2161 

1555 

1.81 

91093 

parallel 

717 

6.18 

3825 

1944 

2.26 

82421 

Table  12.  Parallel  performance  for  case  08,  100  iterations 


#processors 

sequential 

or 

parallel 

execution 

(real) 

time 

parallel 

overhead 

spaed-up 

MFLOPS 

memory 

usage 

(MB) 

memory 

overhead 

SRU 

sequential 

1676 

- 

1.00 

375 

211 

27197 

parallel 

1752 

1.05 

0.96 

362 

228 

1.08 

25539 

parallel 

560 

2.99 

1117 

281 

1.33 

24390 

parallel 

325 

5.16 

1923 

346 

1.64 

26005 
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Table  13.  Parallel  performance  for  case  09,  100  iterations 


#proco««ors 

••qusntial 

or 

paraUal 

execution 

(r««l) 

time 

parallel 

overhead 

•peed-up 

MFLOPS 

memory 

usage 

(MB) 

memory 

overhead 

SRU 

■ 

sequential 

parallel 

parallel 

parallel 

2294 

2326 

681 

396 

1.01 

1.00 

0.99 

3.37 

5,79 

395 

389 

1320 

2269 

285 

302 

357 

436 

1.06 

1.25 

1.53 

Table  14.  Parallel  performance  for  case  10,  100  iterations 


^processors 

sequent  ial 

or 

parallel 

execution 

(real) 

time 

parallel 

overhead 

speed-up 

MFLOPS 

memory 

usage 

(MB) 

memory 

overhead 

SRU 

1 

1 

4 

8 

sequential 

parallel 

parallel 

parallel 

198 

195 

64 

31 

0.98 

1.00 

1.02 

3.09 

6.39 

570 

571 

1715 

3487 

104 

104 

183 

249 

1,00 

1,76 

2.39 

2784 

2776 

3028 

3135 

This  article  was  processed  using  the  macro  package  with  LLNCS  style 
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Abstract.  This  paper  presents  an  experimental  validation  of  makespan 
improvements  of  two  scheduling  algorithms:  a  greedy  construction  algo¬ 
rithm  and  a  tabu  search  based  algorithm.  Synthetic  parallel  executions 
were  performed  using  the  scheduled  graph  costs.  These  synthetic  execu¬ 
tions  were  performed  on  a  real  parallel  machine  (IBM  SP).  The  estimated 
and  observed  response  times  improvements  are  very  similar,  representing 
the  low  impact  of  system  overhead  on  makespan  improvement  estimation. 
This  guarantees  a  reliable  cost  function  for  static  scheduling  algorithms 
and  confirms  the  actual  better  results  of  the  tabu  search  metaheuristic 
applied  to  scheduling  problems. 


1  Introduction 

Parallel  applications  with  regular  and  well-known  behavior,  where  task  execution 
time  estimates  are  fairly  reliable,  are  suited  to  static  task  scheduling  (in  opposi¬ 
tion  to  dynamic  scheduling,  performed  during  the  execution  of  the  application). 
This  is  the  case  of  a  great  majority  of  scientific  applications.  For  these  applica¬ 
tions,  the  static  scheduling  algorithm  is  executed  once,  before  the  execution  of 
the  parallel  program,  which  is  then  actually  run  several  times  according  to  the 
previously  obtained  schedule.  Consequently,  even  if  the  scheduling  algorithm  is  a 
costly  procedure,  this  cost  will  be  amortized  throughout  the  numerous  executions 
of  the  parallel  application,  i.e.  the  obtained  schedule  is  re-applied  repeatedly. 

Static  task  scheduling  is,  thus,  performed  based  on  estimated  data  about 
the  parallel  application  and  the  system  architecture.  Therefore,  realistic  perfor¬ 
mance  evaluation  of  a  task  scheduling  algorithm  can  only  be  fully  accomplished 
if  practical  results  are  also  considered.  In  this  sense,  the  present  work  analyzes 
the  quality  of  greedy  and  tabu  search  task  scheduling  algorithms  comparing  esti¬ 
mated  deterministic  results  with  the  actual  observed  makespan  of  several  parallel 
synthetic  applications  executing  on  real  heterogeneous  parallel  machines  follow¬ 
ing  the  static  schedule  previously  determined.  The  following  section  presents  the 
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schedule  system  model.  In  Section  3,  both  the  greedy  and  tabu  search  algorithms 
are  described.  In  Section  4,  we  report  the  overall  experimentation,  including:  (i) 
a  description  of  the  testing  platform  and  problem  instances  considered  during 
the  testing  phase;  (ii)  the  most  significant  numerical  results,  and  (iii)  the  com¬ 
parative  solution  quality  analysis  according  to  different  parameters.  Section  5 
presents  some  brief  concluding  remarks. 

2  The  Scheduling  Model 

A  parallel  application  II  with  a  set  of  n  tasks  T  =  {ti,  •  •  ■  ,tn)  and  a  heteroge¬ 
neous  multiprocessor  system  composed  by  a  set  of  m  interconnected  processors 
p  =  {pi,  •  ■  •  ,pm}  can  be  represented  by  a  task  precedence  graph  G{n)  and  an 
n  X  m  matrix  p,  where  fikj  =  is  the  estimated  execution  time  of  a  task 

t*.  £  T  at  processor  pj  €  P.  Each  processor  can  run  one  task  at  a  time,  all 
tasks  can  be  executed  by  any  processor^  and  processors  are  said  to  be  uniform 
in  the  sense  that  €  T,'ip,,pj  G  P.  This  implies  that  processors 

may  be  ranked  according  to  their  processing  speeds.  In  a  framework  with  one 
single  faster  (heterogeneous)  processor,  the  heterogeneity  may  be  expressed  by 
a  unique  parameter  called  processor  power  ratio,  PPR.,  which  is  the  ratio  be¬ 
tween  the  processing  speed  of  the  fastest  processor  and  that  of  the  remaining 
ones  (those  in  the  subset  of  homogeneous  processors).  Thus,  an  instance  of  our 
scheduling  problem  is  characterized  by  the  workload  and  parallel  system  models. 

Given  a  solution  s  for  the  scheduling  problem,  a  processor  assignment  func¬ 
tion  is  designed  as  the  mapping  As  ■  T  P.  A  task  tk  is  said  to  be  assigned 
to  processor  pj  G  P  in  solution  s  if  >ts(4)  =  Pj-  The  task  scheduling  problem 
can  then  be  formulated  as  the  search  for  an  optimal  assignment  of  the  set  of 
tasks  onto  that  of  the  processors,  in  terms  of  the  makespan  c(s)  of  the  parallel 
application  (cost  of  the  solution  s),  i.e.  the  completion  time  of  the  last  task  being 
executed.  At  the  end  of  the  scheduling  process,  each  processor  ends  up  with  an 
ordered  list  of  tasks  that  will  run  on  it  as  soon  as  they  become  executable. 

3  Heuristic  Task  Scheduling  Algorithms 

We  consider  two  algorithms  in  this  work,  namely:  a  greedy  algorithm  called 
DES-h  MET  and  a  parallel  tabu  search  algorithm,  here  referred  as  TSpar.  Al¬ 
though  both  of  them  are  heuristic,  they  present  different  fundamental  charac¬ 
teristics.  The  former  is  a  construction  algorithm,  which  iteratively  assigns  tasks 
to  processors  based  on  heuristic  criteria,  taking  into  account  the  static  infor¬ 
mation  of  the  system  model.  On  the  other  hand,  the  TSpar  is  a  synchronous 
parallel  implementation  of  a  tabu  search  metaheuristic  algorithm,  which  guides 
an  aggressive  local  search  procedure  over  the  task  scheduling  solution  space. 

3.1  The  DES-fMFT  Greedy  Algorithm 

DES-I-MFT  stands  for  Deterministic  Execution  Simulation  with  Minimum  Finish 
Time  [4].  This  algorithm  iteratively  schedules  tasks  in  a  partial  order  according 
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to  the  simulated  execution  of  the  parallel  application  (DES),  based  on  the  esti¬ 
mated  task  execution  times,  while  scheduling  decisions  are  made  according  to  the 
minimum  finishing  time  (MET)  for  each  “schedulable”  task.  Figure  1  describes 
the  DES-hMFT  in  a  procedural  scheme.  In  this  scheme,  the  dock  variable  mea¬ 
sures  the  evolution  of  the  execution.  At  the  end  of  this  procedure,  c(.s)  =  dock 
is  the  cost  of  the  obtained  solution,  i.e.,  the  makespan  of  the  parallel  applica¬ 
tion  when  submitted  to  the  DESd-MFT  processor  assignment.  At  each  iteration, 
certain  tasks  are  scheduled  to  processors,  building  an  ordered  list  of  tasks  as¬ 
sociated  to  each  processor.  This  is  the  actual  execution  order  if  tasks  were  to 
be  executed  in  an  ideal  system  with  estimated  execution  times.  During  this  de¬ 
terministic  execution  simulation,  each  task  t*,  €  T  assumes  one  of  the  following 
states  at  each  time  instant:  non-executable,  executable,  executing,  executed.  At 
the  same  time,  each  processor  pj  £  P  alternates  between  two  different  states. 
free  and  busy.  A  processor  pj  is  said  to  be  busy  if  it  has  a  task  in  the  executing 
state  allocated  to  it. 

It  should  be  noticed  that  DES-hMFT,  like  most  greedy  algorithms,  does  not 
come  back  to  re-evaluate  the  scheduling  decisions  taken  in  previous  iterations. 
This  means  that  besides  the  “look-ahead”  feature,  it  is  not  capable  of  making 
changes  in  scheduling  decisions  made  in  previous  iterations,  which  were  based  on 
snapshots  of  the  simulated  execution.  Consequently,  these  scheduling  decisions 
depend  on  how  strongly  tasks  are  tied  through  precedence  relations,  because  they 
determine  the  order  in  which  tasks  may  possibly  be  scheduled.  Differently,  the 
TSpar  algorithm,  departing  from  the  initial  solution  obtained  by  the  DES-t-MFT 
algorithm,  evaluates  many  other  possible  assignments,  which  eventually  improve 
the  makespan  of  the  parallel  application,  as  we  can  see  in  the  following  section. 

3.2  The  Parallel  Tabu  Search  Algorithm 

To  describe  the  TSpar  algorithm,  we  first  consider  a  general  combinatorial  op¬ 
timization  problem  {P)  formulated  as  to 

minimize  c(s) 

subject  to  s  €  5, 

where  5  is  a  discrete  set  of  feasible  solutions.  Local  search  approaches  foi  solving 
problem  ( P)  are  based  on  search  procedures  in  the  solution  space  S  starting  from 
an  initial  solution  Sq  £  S.  kt  each  iteration,  a  heuristic  is  used  to  obtain  a  new 
solution  s'  in  the  neighborhood  N{s)  of  the  current  solution  s,  through  slight 
changes  in  s.  A  move  is  an  atomic  change  which  transforms  the  current  solution, 
s.  into  one  of  its  neighbors,  say  s.  Thus,  movevalue  =  c(s)  -  c(s)  is  the  difference 
between  the  value  of  the  cost  function  after  the  move,  c(s),  and  the  value  of  the 
cost  function  before  the  move,  c(s).  Every  feasible  solution  s  G  N{s)  is  evaluated 
according  to  the  cost  function  c(.),  which  is  eventually  optimized.  The  cunent 
solution  moves  smoothly  towards  better  neighbor  solutions,  enhancing  the  best 
obtained  solution  s* . 

Tabu  search  [1, 2]  may  be  described  as  a  higher  level  heuristic  for  solving  min¬ 
imization  problems,  designed  to  guide  other  hill- descending  heuristics  in  order 
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DES+MFT  algorithm 
begin 

clock  <—  0 

state{pj)  ■<—  free  Vpj  6  P 

start{tk),  finish{tk)  <—  0  Wtk  €.  T 

while  (3tA.  6  T  |  state{tk)  /  executed)  do 

begin 

for  (each  t*.  €  T  [  state{tk)  =  executable  and  pj  €  P)  do 
obtain  the  pair  {ti,p,)  with  the  minimum  finish  time 
if  {state{pi)  =  free)  then 
begin 

state(ti)  executing 
As{ti)=Pi 
state{pi)  <-  busy 
start{ti)  4—  clock 
finish{ti)  4—  start{ti)  +  p{ti,pi) 
end 

Let  i  be  such  that  finish(ti)  =  niiUj^g^l 
clock  4—  finish{ti) 

for  (each  t*  €  T  |  state{tk)  =  executing  and  finish{tk)  =  clock)  do 

begin  . 

state{tk)  4—  executed 
state{As{t.k)]  4-  free 

end 

end 

c(s)  4—  clock 
end 


Fig.  1.  DES+MFT  algorithm  description. 


to  escape  from  local  optima.  Thus,  tabu  search  is  an  adaptive  search  technique 
that  aims  to  intelligently  exploring  the  solution  space  in  search  of  good,  hopefully 
optimal,  solutions.  The  learning  capability  determines  that  tabu  search  supplies 
richer  knowledge  about  the  instance  of  the  problem  to  be  solved  than  that  gen¬ 
erated  in  other  iterative  algorithms.  In  the  case  of  the  task  scheduling  problem 
considered  in  this  paper,  the  cost  of  a  solution  is  given  by  its  makespan,  i.e., 
the  overall  execution  time  of  the  parallel  application.  The  neighborhood  A'(s) 
of  the  current  solution  s  is  the  set  of  all  solutions  differing  from  it  by  only  a 
single  assignment.  If  s  €  N{s),  then  there  is  only  one  task  tj  E  T  for  which 
As{ti)  As{ti).  Each  move  may  be  characterized  by  a  simple  representation 
given  by  {As(ti),ti,Pi)i  as  far  as  the  position  task  t,  will  occupy  in  the  task  list 
of  processor  pt  is  uniquely  defined.  If  the  best  move  takes  the  current  solution 
s  to  a  best  neighbor  solution  s'  degenerating  its  cost  function,  i.e.  c(.s')  >  c(.s). 
then  the  reverse  move  must  be  prohibited  during  a  certain  number  of  iterations 
(tabu  tenure)  in  order  to  avoid  cycling.  However,  there  are  situations  in  which  a 
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recently  prohibited  move,  if  applied  after  some  iterations,  will  provide  a  better 
solution  than  the  best  one  found  by  the  algorithm  so  far,  despite  its  prohibited 
status.  In  these  cases,  an  aspiration  criterion  is  used  to  override  this  prohibi¬ 
tion,  enabling  the  move  to  be  executed.  In  [6]  and  [7]  the  reader  will  find  more 
detailed  description  of  the  tabu  search  algorithm. 

The  promising  results  obtained  through  parallelization  led  to  the  possibil¬ 
ity  of  more  effectively  evaluating  solution  quality  of  the  proposed  tabu  search 
task  scheduling  algorithm  using  a  parallel  implementation.  Considering  both  se¬ 
quential  and  parallel  implementation,  solution  quality  was  analyzed  according  to 
different  parameters  and  strategies,  which  needed  to  fully  specify  the  tabu  search 
algorithm  with  a  certain  variety  of  application  model  parameters  (such  as  task 
graph  structures,  number  of  tasks,  serial  fraction  and  task  service  demands) 
and  system  configurations  (such  cis  number  of  processors  and  architecture  het¬ 
erogeneity  measured  by  the  processor  power  ratio).  It  was  shown  that  the  tabu 
search  algorithm  obtained  better  results,  i.e.  shorter  completion  times  for  parallel 
applications,  improving  up  to  40%  the  makespan  obtained  by  the  DES-I-MFT 
algorithm,  which  in  fact  is  the  most  appropriate  greedy  algorithm  previously 
published  in  the  literature  [6,8].  We  have  used  the  MS-MP  parallel  version  to 
carry  out  the  experimentation  reported  here,  because  it  has  demonstrated  the 
best  speedup  results  in  most  of  the  studied  cases  [7], 

4  Experimental  Results 

In  this  section,  we  depict  some  experimental  results  obtained  from  the  execution 
of  synthetic  parallel  programs  scheduled  with  both  the  greedy  and  tabu  search  al¬ 
gorithms.  We  first  present  some  results  derived  from  the  estimated  improvement 
analysis  of  tabu  search  schedules  over  those  generated  by  the  DES-I-MFT,  which 
is  the  initial  solution  for  the  tabu  search  algorithm.  The  performance  critenum 
is  the  makespan  (solution  cost)  estimated  by  both  algorithms.  In  the  following, 
we  describe  ANDES  [3],  a  framework  for  performance  evaluation  using  parallel 
program  models  and  synthetic  programs.  Finally,  using  this  framework,  we  com¬ 
pare  execution  times  of  synthetic  parallel  programs  scheduled  by  DES-hMFT 
and  TSpar  algorithms. 

4.1  Estimated  Performance  Analysis 

DES-l-MFT  and  TSpar  scheduling  algorithms  were  implemented  using  ANSI 
C  and  PVM  {Parallel  Virtual  Machine)  [9].  The  schedule  quality  is  estimated 
based  on  the  computed  makespan.  In  other  words,  the  makespan  represents  the 
schedule  cost,  c(.),  which  is  to  be  minimized. 

One  of  the  main  goals  is  to  achieve  makespan  reduction  when  changing  from 
the  schedule  produced  by  DES-hMFT  to  the  one  produced  by  TSpar.  Thus, 
solution  quality  is  measured  by  relative  cost  reduction,  71,  computed  as 

_  c(so)  -  c(s*) 

c(so) 
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where  Sq  is  the  initial  solution  obtained  by  the  greedy  algorithm  DES+MFT 
and  A*  is  the  best  solution  found  by  the  TSpar  algorithm. 

In  [6],  relative  cost  reduction  values  of  up  to  30%  were  obtained  considering 
applications  modeled  by  diamond-shaped  precedence  graphs.  In  [8],  new  results 
were  presented  considering  other  structures  for  the  parallel  applications.  Part  of 
the  ANDES  benchmark  was  then  used;  other  types  of  diamond-shaped  graphs 
(Diamonds  and  Diamond4),  iterative  graphs  (FFT  and  PDE2),  divide-and-conquer 
strategies  (Divconq),  typical  matrix  computation  structures  (Gauss),  and  mas¬ 
ter-slave  models  (MSS).  We  can  summarize  the  following  results  of  these  above 
experiments: 


—  A  parallel  application  is  said  to  be  serialized  by  a  certain  processor  assign¬ 
ment  algorithm  when  all  of  its  tasks  are  scheduled  to  one  unique  processor. 
When  the  serial  fraction  (F^)  and/or  the  processor  power  ratio  {PPR)  are 
very  high,  the  best  solution  is  usually  obtained  through  the  serialization  of 
the  application  over  the  heterogeneous  processor,  which  has  greater  process¬ 
ing  capacity.  This  seems  to  be  clear  if  we  imagine  two  extreme  cases:  Fs  -  I 
or  PPR  — y  00.  In  the  first  case,  we  face  a  totally  serial  application,  which 
must  be  executed  on  the  heterogeneous  processor  (F*  corresponds  to  the 
serial  fraction  defined  as  the  fraction  of  the  total  parallel  execution  time 
when  just  one  task  is  executing  even  if  infinite  processors  were  available). 
In  the  latter  case,  the  heterogeneous  processor  is  able  to  execute  any  task 
in  infinitesimal  time,  consequently  serialization  determines  again  the  best 
performance. 

In  certain  circumstances,  serialization  will  be  performed  by  the  DES-f  MFT 
algorithm,  when  there  is  still  available  parallelism  to  be  explored  in  the  par¬ 
allel  application.  In  these  cases,  the  tabu  search  algorithm  will  start  from  a 
serialized  initial  schedule,  and  more  easily  will  be  capable  of  finding  different 
assignments  which  greatly  reduce  the  overall  makespan  of  the  application, 
augmenting  the  relative  cost  reduction. 

For  very  low  and  very  high  PPR  values  low  or  null  makespan  improvements 
are  obtained.  A  low  PPR  value  means  low  heterogeneity  degree,  and,  in 
this  case,  the  greedy  algorithm  improvements  are  sufficient  (it  is  suitable  for 
homogeneous  configurations).  On  the  other  end  of  the  heterogeneity  range, 
very  high  PPR  values  mean  that  serialization  on  the  very  fast  processor  is 
the  best  solution.  In  these  cases  both  the  DES-f-MFT  and  TSpar  algorithms 
are  able  of  serializing  the  application,  so  makespan  improvements  are  not 
observed; 

-  Between  the  two  extremes  of  the  PPR  value  range,  we  find  a  mountain-like 
peak  of  improvements,  culminating  with  a  PPR  that  gives  the  best  relative 
performance  achieved  by  the  TSpar  algorithm.  This  point  is  referred  as  the 
PPRp.ak  point.  The  PPRpeak  point  is  highly  dependent  on  the  shape  of 
the  input  task  graph.  Groups  of  similar  task  graphs  have  a  similar  behavior. 
For  example,  diamond-shaped  graphs  present  a  low  PPRpenk  (around  5). 
On  the  other  hand,  iterative  graphs  produce  a  mOre  smooth  improvement 


78 


VECPAR  '98  -  3rd  International  Meeting  on  Vector  and  Parallel  Processing 


curve,  with  higher  PPRpeak  (around  20  or  30),  depending  on  the  size  of  the 
task  graph; 

—  Not  only  the  structure  of  the  task  graph  is  critical  in  the  relative  quality 
improvement  analysis.  The  number  of  processors  available  for  scheduling  as¬ 
signments  influence  the  results.  The  relationship  between  solution  quality 
improvements  and  the  number  of  processors  is  variable  depending  on  the 
structure  of  the  task  graph.  On  one  hand,  the  greater  the  number  of  pro¬ 
cessors  we  have,  the  less  heterogeneous  the  system  becomes  and  thus  lower 
relative  cost  reduction  is  achieved.  However,  a  greater  number  of  processors 
also  represents  more  available  parallelism  and  therefore  a  greater  number  of 
different  scheduling  possibilities  arise. 

Figure  3  presents  some  estimated  relative  cost  reduction  values  computed 
between  DES+MFT  and  TSpar  algorithms.  In  [8],  Porto  et  al  measured  im¬ 
provements  for  discrete  values  of  PPR  (2,  5,  10,  20,  ... ,  depending  on  the  input). 
Figure  2  presents  a  more  detailed  experiment,  with  a  fine  variation  of  PPR  values 
and  number  of  processors,  considering  the  Diamonds  benchmark  with  66  tasks. 


Relative  Reduction  for  Diamonds  with  66  tasks 


Fig.  2.  Detailed  relative  cost  reduction  “R  versus  PPR  for  Diamonds  graph  {m  corre¬ 
sponds  to  the  number  of  processors  to  which  the  tasks  are  scheduled). 


4.2  The  Experimental  Framework 

The  ANDES  Environment  -  ANDES  [8]  is  a  PVM-based  parallel  tool  that 
supports  performance  evaluation  of  parallel  programs  at  the  piediction  let  el. 
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ANDES  considers  the  existing  complex  overheads  of  parallel  computers.  This 
is  achieved  through  the  use  of  synthetic  parallel  executions  directly  on  the  par¬ 
allel  machine.  In  a  synthetic  parallel  execution,  the  resources  of  the  parallel 
computer  are  used  in  a  controlled  way,  but  no  code  is  generated.  All  the  steps 
from  the  interpretation  of  the  parallel  program  graph-based  and  of  the  paral¬ 
lel  machine  models  to  the  synthetic  execution  on  the  target  parallel  machine 
are  automatically  managed  by  ANDES.  ANDES  finally  computes  performance 
metrics  along  the  execution  of  that  workload  implemented  according  to  mapping 
and/or  scheduling  strategies.  Synthetic  execution  was  chosen  as  the  performance 
technique  due  to  the  easy  control  of  parameters  as  well  as  the  possibility  of  using 
a  real  envionment.  The  idea  is  to  conjugate  the  best  of  model-based  approaches 
with  the  best  of  realistic  parallel  executions.  ANDES  has  been  used  to  refine 
analytical  and  simulation  analysis.  With  the  current  high  availability  of  parallel 
systems,  the  results  of  ANDES  have  been  proved  to  be  precise  and  useful. 

The  Parallel  System  -  ANDES  along  with  the  synthetic  parallel  programs 
were  executed  on  an  IBM  SP  multicomputer  composed  of  32  RS6000  RISC 
microprocessors  with  64  megabytes  of  RAM.  The  processors  are  interconnected 
by  a  high-speed  switch  (bidirectional  with  nominal  speed  of  80  megabytes  per 
second). 

The  Benchmarks  -  In  order  to  compare  estimated  and  observed  improvements 
of  the  overall  execution  times  of  real  parallel  synthetic  programs,  we  have  used 
the  following  benchmark  (part  of  the  ANDES  package):  (i)  Diamonds  with  66 
tasks;  (ii)  FFT  with  194  tasks;  (iii)Gauss  with  192  tasks;  and  (iv)  Divconq  with 
46  tasks. 

This  benchmark  picks  representative  task  graphs  from  the  ones  studied  in 
[8].  Small  and  larger  task  graphs  are  used.  The  TSpar  was  executed  using  4 
processors  of  the  IBM  SP.  The  estimated  quality  of  both  TSpar  and  DES-I-MFT 
algorithms  is  computed  using  a  conventional  C  procedure  for  computing  the 
makespan  of  the  task  graphs,  detailed  in  Figure  4  (very  similar  to  the  DES-t- 
MFT  description).  The  final  value  of  clock  is  the  actual  makespan.  Each  graph 
of  the  benchmark  is  scheduled  to  2,  4,  8,  and  16  processors. 

The  generated  schedules  are  read  by  ANDES  which  generates  the  synthetic 
load  to  be  interpreted  by  ANDES-Synth,  the  synthetic  execution  kernel.  Syn¬ 
thetic  loads  are  then  executed  according  to  the  given  schedules. 

In  order  to  simulate  heterogeneity,  the  size  of  synthetic  loops  corresponding 
to  tasks  allocated  to  the  faster  processor  are  reduced  by  a  factor  corresponding 
to  the  PPR  itself.  Thus,  a  PPR  of  2  means  that  loops  to  be  executed  on  the  het¬ 
erogeneous  processor  are  reduced  by  half.  The  scheduling  algorithms  consider 
communications  with  zero  overhead.  This  corresponds  in  ANDES  to  commu¬ 
nications  of  a  single  byte  (in  the  IBM  SP  machine,  such  message  transmitted 
through  the  switch  determines  a  latency  of  around  47.03  microseconds  [5]). 

Preliminary  experiments  were  performed  on  an  idle  machine.  The  standard 
deviation  was  always  under  1%  for  10  consecutive  executions.  Considering  this 
low  degree  of  variability,  we  have  performed  measures  using  a  sample  of  size  5. 
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4.3  Results  and  Analysis 

Figures  5.  6.  7,  and  8  present,  in  the  same  graphic,  estimated  and  measured 
relative  cost  reduction  values.  The  chosen  PPR  value  range  includes,  for  all 
graphics,  the  higher  relative  cost  reduction  values  achieved  by  TSpar.  Differences 
between  estimated  and  observed  improvements  are  under  5%  for  all  experiments. 

Our  results  demonstrated  by  the  similarity  between  estimated  and  observed 
relative  cost  reduction  values  that  the  makespan  computation  used  in  both 
scheduling  algorithms  is  in  fact  reliable.  This  computation  is  completely  de¬ 
terministic.  On  the  other  hand,  the  observed  execution  times  are  definitely  non- 
deterministic  due  to  the  overhead  from  the  operating  system  and  the  communi¬ 
cation  subsystem.  However,  the  execution  times  presented  very  low  variability. 
Therefore,  this  overhead  does  not  influence  significantly  the  experimental  execu¬ 
tion  times,  i.e.  the  makespan  algorithm  shows  itself  to  be  very  useful  to  the  static 
scheduling  decisions  based  on  estimated  data.  Although  intuitive,  this  conclusion 
is  not  obvious  and  experiments  were  necessary  to  validate  it. 

Taking  into  account  a  precise  makespan  computation,  one  important  con¬ 
sequence  is  that  tabu  search  improvements  are  real  and  significant.  This  w'as 
foreseen  from  previous  work,  based  on  the  estimated  relative  cost  reduction  val¬ 
ues  between  DES-f  MFT  and  TSpar  algorithms.  In  this  paper,  we  demonstrate 
that  these  improvements  also  occur  in  more  realistic  execution  environments. 

Another  interesting  result  is  that  the  P P Rpeak  is  not  always  the  same.  As  a 
matter  of  fact,  there  is  a  range  of  PPR  values  where  the  best  relative  cost  reduc¬ 
tion  varies.  This  irregular  behavior  occurs  due  to  the  irregular  search  through 
the  solution  space  performed  by  the  tabu  search  algorithm,  which  depends  on 
different  heuristic  parameters  such  as  tabu  list  size,  number  of  iterations  with¬ 
out  improvements,  and"  aspiration  criteria.  Metaheuristics,  such  as  tabu  search, 
frequently  depend  on  a  fine  tuning  stage,  where  parameters  are  tested  and  cali¬ 
brated.  After  this  step,  they  remain  unchanged,  and  in  some  test  cases  they  are 
not  always  set  to  achieve  the  best  results. 

Finally,  ANDES  has  been  proven  to  be  a  useful  tool  in  the  validation  of 
scheduling  algorithms.  The  direct  combination  of  both  scheduling  algorithms 
and  the  synthetic  execution  runtime  system  provided  an  environment  where 
response  time  measurements  could  be  quickly  obtained. 

5  Final  Remarks 

This  paper  presents  an  experimental  validation  of  makespan  improvements  of 
tw'o  scheduling  algorithms:  a  greedy  construction  algorithm  and  a  tabu  search 
based  algorithm.  Synthetic  parallel  executions  were  performed  given  data  on 
task  execution  times,  task  precedence  relations,  and  task  scheduling.  These  syn¬ 
thetic  executions  were  performed  on  a  real  parallel  machine  (IBM  SP ) .  The  esti¬ 
mated  and  observed  response  times  improvements  are  very  similar,  representing 
the  low  impact  of  system  overhead  on  makespan  improvement  estimation.  This 
guarantees  a  reliable  cost  function  for  static  scheduling  algorithms  and  confirms 
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the  actual  better  results  of  the  tabu  search  metaheuristic  applied  to  scheduling 
problems. 
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Fig.  3.  Relative  cost  reduction  71  versus  PPR  for  two  different  sizes  of  Diamonds, 
Divconq,  FFT,  Gauss,  and  MSS  graphs  (m  corresponds  to  the  number  of  processors  to 
which  the  tasks  are  scheduled). 


83 


FEUP  ■  Faculdade  de  Engenharia  da  Universidade  do  Porto 


makespan  computation  algorithm 
begin 

Let  s  =  (A(ti),  •  •  • ,  A(tr.))  be  a  feasible  solution  for  the  scheduling  problem,  i.e., 
for  every  k  =  1, . . . ,  n,  As{tk)  =  Pj  for  some  pj  6  P 
clock  <—  0 

stateipj)  <—  free  Vpj  €  P 
startltk),  finish{tk)  <—  0  'itk  €  T 
while  {3tk  6  T  \  state{tk)  #  executed)  do 
begin 

for  (each  tk  eT  \  state{tk)  =  executable)  do 
if  {state{Asitk))  =  free)  then 
begin 

state{tk)  t-  executing 
st.ate{As{tk))  <-  busy 
start{tk)  ■<—  clock 

finish(tk)  4-  start{tk)  +  pitk,As[tk)) 

end 

Let  i  be  such  that  finish{ti)  =  =executing'{'^*”**^(^*')} 

clock  f-  }inish{U) 

for  (each  t*.  6  T  |  state{tk)  =  executing  and  finish{tk)  =  clock)  do 
begin 

state{tk)  4—  executed 
state{As{tk))  4-  free 

end 

end 

c{s)  4—  clock 

end 

Fig.  4.  Computation  of  the  makespan  of  a  given  schedule. 
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PPR 


Fig.  5.  Estimated  (est)  and  observed  (obs)  relative  cost  reduction  H  versus  PPE  foi 
Diamonds  graph  (m  corresponds  to  the  number  of  processors  on  which  the  tasks  are 
scheduled). 
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Fig.  6.  Estimated  (est)  and  observed  (obs)  relative  cost  reduction  Tl  versus  PPR 
for  FFT  graph  (rn  corresponds  to  the  number  of  processors  on  which  the  tasks  are 
scheduled) . 


Fig.  8.  Estimated  (est)  and  observed  (obs)  relative  cost  reduction  TZ  versus  PPB 
for  Gauss  graph  (tu  corresponds  to  the  number  of  processors  on  which  the  tasks  are 
scheduled). 
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Abstract.  A  code  for  the  simulation  of  the  turbulent  reactive  flow  with  heat  transfer  in  a 
utility  boiler  has  been  parallelized  using  MPI.  This  paper  reports  a  comparison  of  the 
parallel  efficiency  of  the  code  using  the  hybrid  central  differences/upwind  and  the  MUSCL 
schemes  for  the  discretization  of  the  convective  terms  of  the  governing  equations.  The 
results  were  obtained  using  a  Cray  T3D  and  a  number  of  processors  in  the  range  1  -  128,  It  is 
shown  that  higher  efficiencies  are  obtained  using  the  MUSCL  scheme  and  that  the  least 
efficient  tasks  are  the  solution  of  the  pressure  correction  equation  and  the  radiative  heat 
transfer  calculation. 

Keywords;  Parallel  Computing;  Discretization  Schemes;  Computational  Fluid  Dynamics; 
Radiation;  Boilers 


1  Introduction 

The  numerical  simulation  of  the  physical  phenomena  that  take  place  in  the 
combustion  chamber  of  a  utility  boiler  is  a  difficult  task  due  to  the  complexity  of 
those  phenomena  (turbulence,  combustion,  radiation)  and  to  the  range  of  geometrical 
length  scales  which  spans  fours  or  five  orders  of  magnitude  [1].  As  a  consequence, 
such  a  simulation  is  quite  demanding  as  far  as  the  computational  resources  are 
concerned.  Therefore,  parallel  computing  can  be  very  useful  in  this  field. 

The  mathematical  modelling  of  a  utility  boiler  is  often  based  on  the  numerical 
solution  of  the  equations  governing  conservation  of  mass,  momentum  and  energy,  and 
transport  equations  for  scalar  quantities  describing  turbulence  and'  combustion.  These 
equations  are  solved  in  an  Eulerian  framework  and  their  numerical  discretization  yields 
convective  terms  which  express  the  flux  of  a  dependent  variable  across  the  faces  of  the 
control  volumes  over  which  the  discretization  is  carried  out.  Many  dicretization 
schemes  for  the  convective  terms  have  been  proposed  along  the  years  and  this  issue 
has  been  one  of  the  most  important  topics  in  computational  fluid  dynamics  research. 
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The  hybrid  central  differences/upwind  scheme  has  been  one  the  most  popular  ones, 
especially  in  incompressible  flows.  However,  it  reverts  to  the  first  order  upwind 
scheme  whenever  the  absolute  value  of  the  local  Peclet  number  is  higher  than  two, 
which  may  be  the  case  in  most  of  the  flow  regions.  This  yields  poor  accuracy  and 
numerical  diffusion  errors.  These  can  only  be  overcome  using  a  fine  grid  which 
enables  a  reduction  of  the  local  Peclet  number,  and  will  ultimately  revert  the  scheme 
to  the  second  order  accurate  central  differences  scheme.  However,  this  often  requires  a 
grid  too  fine,  and  there  is  nowadays  general  consensus  that  the  hybrid  scheme  should 
not  be  used  (see,  e.g.,  [2]).  Moreover,  some  leading  journals  presently  request  that 
solution  methods  must  be  at  least  second  order  accurate  in  space.  Alternative 
discretization  schemes,  such  as  the  skew  upwind,  second  order  upwind  and  QUICK,  are 
more  accurate  but  may  have  stability  and/or  boundedness  problems.  Remedies  to 
overcome  these  limitations  have  been  proposed  more  recently  and  there  are  presently 
several  schemes  available  which  are  stable,  bounded  and  at  least  second  order  accurate 
(see,  e.g.,  [3  -  9]). 

Several  high  resolution  schemes  have  been  incorporated  in  the  code  presented  in 
[10]  for  the  calculation  of  laminar  or  turbulent  incompressible  fluid  flows  in  two  or 
three-dimensional  geometries.  Several  modules  were  coupled  to  this  code  enabling  the 
modelling  of  combustion,  radiation  and  pollutants  formation.  In  this  work,  the  code 
was  applied  to  the  simulation  of  a  utility  boiler,  and  a  comparison  of  the  efficiency 
obtained  using  the  hybrid  and  the  MUSCL  ([11])  schemes  is  presented.  The 
mathematical  model  and  the  parallel  implementation  are  described  in  the  next  two 
sections.  Then,  the  results  are  presented  and  discussed,  and  the  conclusions  are 
summarized  in  the  last  section. 


2  The  Mathematical  Model 

2.1  Main  features  of  the  model 

The  mathematical  model  is  based  on  the  numerical  solution  of  the  density  weighted 
averaged  form  of  the  equations  governing  conservation  of  mass,  momentum  and 
energy,  and  transport  equations  for  scalar  quantities.  Only  a  brief  description  of  the 
reactive  fluid  flow  model  is  given  below.  Further  details  may  be  found  in  [12]. 

The  Reynolds  stresses  and  the  turbulent  scalar  fluxes  are  determined  by  means  of 
the  k-e  eddy  viscosity/diffusivity  model  which  comprises  transport  equations  for  the 
turbulent  kinetic  energy  and  its  dissipation  rate.  Standard  values  are  assigned  to  all  the 
constants  of  the  model. 

Combustion  modelling  is  based  on  the  conserved  scalar/probability  density 
function  approach.  A  chemical  equilibrium  code  is  used  to  obtain  the  relationship 
between  instantaneous  values  of  the  mixture  fraction  and  the  density  and  chemical 
species  concentrations.  The  calculation  of  the  mean  values  of  these  quantities  requires 
an  integration  of  the  instantaneous  values  weighted  by  the  assumed  probability  density 
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function  over  the  mixture  fraction  range.  These  calculations  are  performed  a  priori  and 
stored  in  tabular  form. 

The  discrete  ordinates  method  [13]  is  used  to  calculate  the  radiative  heat  transfer  in 
the  combustion  chamber.  The  S4  approximation,  the  level  symmetric  quadramre 
satisfying  sequential  odd  moments  [14]  and  the  step  scheme  are  employed.  The  radiant 
superheaters  which  are  suspended  from  the  top  of  the  combustion  chamber  are 
simulated  as  baffles  without  thickness  as  reported  in  [15].  The  radiative  properties  of 
the  medium  are  calculated  using  the  weighted  sum  of  grey  gases  model. 

The  governing  equations  are  discretized  over  a  Cartesian,  non-staggered  grid  using  a 
finite  volume/fmite  difference  method.  The  convective  terms  are  discretized  using 
either  the  hybrid  or  the  MUSCL  schemes.  The  solution  algorithm  is  based  on  the 
SIMPLE  method.  The  algebraic  sets  of  discretized  equations  are  solved  using  the 
Gauss-Seidel  line-by-line  iterative  procedure,  except  the  pressure  correction  equation 
which  is  solved  using  a  preconditioned  conjugate  gradient  method. 

2.2  Discretization  of  the  convective  terms 

The  discretized  equation  for  a  dependent  variable  (()  at  grid  node  P  may  be  written  in  the 
following  compact  form: 


ap  (jip  =  X^i  ‘t>i  ^ 

I 

where  the  coeficients  a;  denote  combined  convection/diffusion  coefficients  and  b  is  a 
source  term.  The  summation  runs  over  all  the  neighbours  of  grid  node  P  (east,  west, 
north,  south,  front  and  back).  Derivation  of  this  discretized  equation  may  be  found, 
e.g.,  in  [16].  If  the  convective  terms  are  computed  by  means  of  the  hybrid 
upwind/central  differences  method,  then  the  system  of  equations  (1)  is  diagonally 
dominant  and  can  be  solved  using  any  conventional  iterative  solution  technique.  If  a 
higher  order  discretization  scheme  is  used,  the  system  of  equations  may  still  have  a 
diagonally  dominant  matrix  of  coefficients  provided  that  the  terms  are  rearranged  using 
a  deferred  correction  technique  [17].  In  this  case,  equation  (1)  is  written  as: 

aU(t)p  =  ZaF  .[.i  +  b  +  Xq  (<!>]''- <t>j)  (2) 

J 

where  the  superscript  U  means  that  the  upwind  scheme  is  used  to  compute  the 
corresponding  variable  or  coefficient,  and  CJ  is  the  convective  flux  at  cell  face  j.  The 
last  term  on  the  right  hand  side  of  the  equation  is  the  contribution  to  the  source  term 
due  to  the  deferred  correction  procedure. 

The  high  order  schemes  were  incorporated  in  the  code  using  the  normalized  variable 
and  space  formulation  methodology  [18].  According  to  this  formulation,  denoting  by 
U,  C,  and  D  the  upstream,  central  and  downstream  grid  nodes  surrounding  the  control 
volume  face  f  (see  Figure  1 ),  the  following  normalized  variables  are  defined: 
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Figure  1  -  Interpolation  grid  nodes  involved  in  the  calculation  of  <t)f. 


(3) 

<t>D-<t>U 

X= 

(4) 

Xd-Xu 

where  x  is  the  coordinate  along  the  direction  of  these  nodes.  The  upwind  scheme 
yields: 


<))f  =  (5) 

while  the  MUSCL  scheme  is  given  by: 

(j)f  =  (2  Xf  -  xc)  t  xc 

(t)f  =  Xf-  xc  +  (t)c 
^f=l 
(t)f  =  (|)c 

3  Parallel  Implementation 

The  parallel  implementation  is  based  on  a  domain  decomposition  strategy  and  the 
communications  among  the  processors  are  accomplished  using  MPI.  This  standard  is 
now  widely  available  and  the  code  is  therefore  easily  portable  across  hardware  ranging 
from  workstation  clusters,  through  shared  memory  modestly  parallel  servers  to 
massively  parallel  systems.  Within  the  domain  decomposition  approach  the 
computational  domain  is  split  up  into  non-overlapping  subdomains,  and  each 
subdomain  is  assigned  to  a  processor.  Each  processor  deals  with  a  subdomain  and 
communicates  and/  synchronizes  its  actions  with  those  of  other  processors  by 
exchanging  messages. 

The  calculation  of  the  coefficients  of  the  discretized  equations  in  a  control  volume 
requires  the  knowledge  of  the  values  of  the  dependent  variables  at  one  or  two 
neighbouring  control  volumes  along  each  direction.  Only  one  neighbour  is  involved  if 
the  hybrid  scheme  is  used,  while  two  neighbours  are  involved  if  the  MUSCL  scheme 


if 

0  <  ())c  <  xc/2 

(6a) 

if 

xc/2  <  (jic  <  1  +  xc  -  Xf 

(6b) 

if 

1  +  Xc  -  Xf  <  <|)c  <  1 

(6c) 

elsewhere 

(6d) 
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is  employed.  In  the  case  of  control  volumes  on  the  boundary  of  a  subdomain,  the 
neighbours  lie  on  a  subdomain  assigned  to  a  different  processor.  In  distributed  memory 
computers  a  processor  has  only  in  its  memory  the  data  of  the  subdomain  assigned  to 
that  processor.  This  implies  that  it  must  exchange  data  with  the  neighbouring 
subdomains. 

Data  transfer  between  neighbouring  subdomains  is  simplified  by  the  use  of  a  buffer 
of  halo  points  around  the  rectangular  subdomain  assigned  to  every  processor.  Hence, 
two  planes  of  auxiliary  points  are  added  to  each  subdomain  boundary,  which  store  data 
calculated  in  neighbouring  subdomains.  These  data  is  periodically  exchanged  between 
neighbouring  subdomains  to  ensure  the  correct  coupling  of  the  local  solutions  into  the 
global  solution.  This  halo  data  transfer  between  neighbouring  processors  is  achieved 
by  a  pair-wise  exchange  of  data.  This  transfer  proceeds  in  parallel  and  it  will  be  referred 
to  as  local  communications.  Local  communication  of  a  dependent  variable  governed  by 
a  conservation  or  transport  equation  is  performed  just  before  the  end  of  each  sweep 
(inner  iteration)  of  the  Gauss-Seidel  procedure  for  that  variable.  Local  communication 
of  the  mean  temperature,  density,  specific  heat  and  effective  viscosity  is  performed 
after  the  update  of  these  quantities,  i.e.,  once  per  outer  iteration.  Besides  the  local 
communications,  the  processors  need  to  communicate  global  data  such  as  the  values  of 
the  residuals  which  need  to  be  accumulated,  or  maximum  or  minimum  values 
determined,  or  values  broadcast.  These  data  exchange  are  referred  to  as  global 
communications  and  are  available  in  standard  message  passing  interfaces. 

While  the  parallelization  of  the  fluid  flow  equations  solver  has  been  widely 
addressed  in  the  literature,  the  parallelization  of  the  radiation  model  has  received  little 
attention.  The  method  employed  here  is  described  in  detail  in  [19-20]  and  uses  the 
spatial  domain  decomposition  method  for  the  parallelization  of  the  discrete  ordinates 
method.  It  has  been  found  that  this  method  is  not  as  efficient  as  the  angular 
decomposition  method,  since  the  convergence  rate  of  the  radiative  calculations  is 
adversely  influenced  by  the  decomposition  of  the  domain,  dropping  fast  as  the  number 
of  processors  increases.  However,  the  compatibility  with  the  domain  decomposition 
technique  used  in  parallel  computational  fluid  dynamics  (CFD)  codes  favours  the  use 
of  the  spatial  domain  decomposition  method  for  the  radiation  in  the  case  of  coupled 
fluid  flow/heat  transfer  problems. 


4  Results  and  Discussion 

The  code  was  applied  to  the  simulation  of  the  physical  phenomena  taking  place  in  the 
combustion  chamber  of  a  power  station  boiler  of  the  Portuguese  Electricity  Utility.  It 
is  a  natural  circulation  drum  fuel-oil  fired  boiler  with  a  pressurized  combustion 
chamber,  parallel  passages  by  the  convection  chamber  and  preheating.  The  boiler  is 
fired  from  three  levels  of  four  burners  each,  placed  on  the  front  wall.  Vaporization  of 
the  fuel  is  assumed  to  occur  instantaneously.  At  maximum  capacity  (771  ton/h  at  167 
bar  and  545°C)  the  output  power  is  250  MWe.  This  boiler  has  been  extensively 
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investigated  in  the  past,  both  experimentally  and  numerically  (see,  e.g.,  [1,  21-23]), 
and  therefore  no  predictions  are  shown  in  this  paper  which  is  concentrated  only  on  the 
parallel  performance  of  the  code. 

The  calculations  were  performed  using  the  Cray  T3D  of  the  University  of 
Edinburgh  in  U.K.  It  comprises  256  nodes  each  with  two  processing  units.  Each 
processing  element  consists  of  a  DEC  Alpha  21064  processor  running  at  150MHz  and 
delivering  150  64-bit  Mflop/s.  The  peak  performance  of  the  512  processing  elements 
is  76.8  Gflop/s. 

Jobs  running  in  the  computer  used  in  the  present  calculations  and  using  less  than 
64  processors  are  restricted  to  a  maximum  of  30  minutes.  Therefore,  to  allow  a 
comparison  between  runs  with  different  number  of  processors  the  initial  calculations, 
summarized  in  tables  1  and  2,  were  carried  out  for  a  fixed  number  of  iterations  (30  for 
the  MUSCL  discretization  scheme  and  80  for  the  hybrid  scheme).  They  were  obtained 
using  a  grid  with  64,000  grid  nodes  (20x40x80).  Radiation  is  not  accounted  for  in  this 
case.  For  a  given  number  of  processors  different  partitions  of  the  computational 
domain  were  tried,  yielding  slightly  different  results.  The  influence  of  the  partition  on 
the  attained  efficiency  is  discussed  in  [24].  Only  the  results  for  the  best  partitions,  as 
far  as  the  efficiency  is  concerned,  are  shown  in  tables  1  and  2. 

The  parallel  performance  of  the  code  is  examined  by  means  of  the  speedup,  S, 
defined  as  the  ratio  of  the  execution  time  of  the  parallel  code  on  one  processor  to  the 
execution  time  on  np  processors  (ttotal)>  efficiency,  e,  defined  as  the  ratio  of 

the  speedup  to  the  number  of  processors.  The  results  obtained  show  that  the  highest 
speedups  are  obtained  when  the  MUSCL  discretization  scheme  is  employed.  For 
example  when  128  processors  are  used  speedups  of  95.1  and  77.4  are  achieved  using 
the  MUSCL  and  the  hybrid  discretization  schemes,  respectively.  As  the  number  of 
processors  increases,  so  does  the  speedup  of  the  MUSCL  calculations  cornpared  to  the 
hybrid  calculations.  For  np=2  we  have  S(MUSCL)/S(hybrid)=l  .04  while  for  np=128 
that  ratio  is  1 .23.  There  are  two  opposite  trends  responsible  for  this  behaviour.  In  fact, 
there  is  more  data  to  communicate  among  the  processors  when  the  MUSCL  scheme  is 
used,  because  there  are  two  planes  of  grid  nodes  in  the  halo  region  compared  to  only 
one  when  the  hybrid  scheme  is  employed.  However,  the  calculation  time  is 
significantly  higher  for  the  MUSCL  scheme  since  the  computation  of  the  coefficients 
of  the  discretized  equations  is  more  involved.  Overall,  the  ratio  of  the  communications 
to  the  total  time  is  larger  for  the  hybrid  scheme,  yielding  smaller  speedups. 

The  calculations  were  divided  into  five  tasks,  for  analysis  purposes,  and  their 
partial  efficiency  is  shown  in  tables  1  and  2.  These  tasks  are:  i)  the  solution  of  the 
momentum  equations,  including  the  calculation  of  the  convective,  diffusive  and  source 
terms,  the  incorporation  of  the  boundary  conditions,  the  calculation  of  the  residuals, 
the  solution  of  the  algebraic  sets  of  equations  and  the  associated  communications 
among  processors;  ii)  the  solution  of  the  pressure  correction  equation  and  the 
correction  of  the  velocity  and  pressure  fields  according  to  the  SIMPLE  algorithm;  iii) 
the  solution  of  the  transport  equations  for  the  turbulent  kinetic  energy  and  its 
dissipation  rate;  iv)  the  solution  of  other  scalar  transport  equations,  namely 
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Table  1.  Parallel  performance  of  the  code  for  the  first  80  iterations  using  the  hybrid 
discretization  scheme. 


"d 

1 

2 

4 

8 

16 

32 

64 

128 

Partition 

1x1x1 

1x2x1 

1x4x1 

1x8x1 

2x4x2 

2x4x4 

2x4x8 

2x8x8 

1666.6 

850.9 

439.5 

231.5 

123.5 

67.8 

36.8 

21.5 

S 

1 

1.96 

3.79 

7.20 

13.5 

24.6 

45.2 

77.4 

e  {%) 

- 

97.9 

94.8 

90.0 

84.3 

76.8 

70.7 

60.5 

Evel  (^) 

_ 

99.5 

97.9 

94.9 

89.5 

84.3 

79.6 

71.7 

ep’(%) 

_ 

97.9 

91.9 

83.2 

73.5 

63.9 

56.0 

39.2 

ek,e(%) 

_ 

96.7 

92.5 

86.8 

82.8 

75.6 

69.4 

60.9 

_ 

97.1 

94.3 

88.9 

84.7 

77.2 

70.5 

61.5 

EnroD  (%) 

— 

97.6 

94.9 

93.6 

87.1 

76.5 

73.5 

71.7 

Table  2. 

Parallel  performance  of  the  code  for  the  first  30  iterations  using  the 
MUSCL  discretization  scheme. 

“d 

1 

2 

4 

8 

16 

32 

64 

128 

Partition 

1x1x1 

1x2x1 

1x4x1 

1x8x1 

2x4x2 

2x4x4 

2x4x8 

2x8x8 

1692.0 

832.2 

420.8 

215.9 

116.1 

61.0 

31.9 

17.8 

S 

1 

2.03 

4.02 

7.84 

14.6 

27.7 

53.1 

95.1 

e  (%) 

_ 

101.7 

100.5 

98.0 

91.1 

86.7 

82.9 

74.3 

Evel  (%) 

— 

97.7 

97.8 

97.1 

91.3 

88.7 

86.7 

82.1 

En  (%) 

_ 

95.3 

84.1 

70.7 

59.2 

49.9 

41.9 

28.6 

ek,E(%) 

_ 

103.0 

102.2 

100.0 

93.2 

89.4 

86.0 

78.5 

(  - 

106.8 

106.4 

104.5 

98.2 

94.2 

90.9 

84.5 

Edfod  (%) 

— 

97.9 

95.7 

93.6 

86.8 

80.6 

76.9 

71.8 

the  enthalpy,  mixture  fraction  and  mixture  fraction  variance  equations;  v)  the 
calculation  of  the  mean  properties,  namely  the  turbulent  viscosity  and  the  mean  values 
of  density,  temperature  and  chemical  species  concentrations.  The  efficiencies  of  these 
five  tasks  are  referred  to  as  Cvei.  £p.  Ek,e>  ^scalars  £prop>  respectively. 

It  can  be  seen  that  the  efficiency  of  the  pressure  correction  task  is  the  lowest  one, 
and  decreases  much  faster  than  the  efficiencies  of  the  other  tasks  when  the  number  of 
processors  increases.  The  reason  for  this  behaviour  is  that  the  amount  of  data  to  be 
communicated  associated  with  this  task  is  quite  large,  as  discussed  in  [24-25]. 
Therefore,  the  corresponding  efficiency  is  strongly  affected  by  the  number  of 
processors.  The  computational  load  of  this  task  is  independent  of  the  discretization 
scheme  of  the  convective  terms  because  the  convective  fluxes  across  the  faces  of  the 
control  volumes  are  determined  by  means  of  the  interpolation  procedure  of  Rhie  and 
Chow  [26]  and  there  is  no  transport  variable  to  be  computed  at  those  faces  as  in  the 
other  transport  equations.  However,  the  communication  time  is  larger  for  the  MUSCL 
scheme,  due  to  the  larger  halo  region.  Therefore,  the  task  efficiency  is  smaller  for  the 
MUSCL  scheme,  in  contrast  to  the  overall  efficiency. 
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The  efficiency  of  the  properties  calculation  is  slightly  higher  for  the  hybrid  scheme 
if  np=16,  and  equal  or  slightly  lower  in  the  other  cases,  but  it  does  not  differ  much 
from  one  scheme  to  the  other.  This  is  a  little  more  difficult  to  interpret  since  the 
computational  load  of  this  task  is  also  independent  of  the  discretization  scheme  and  the 
communication  time  is  larger  for  the  MUSCL  scheme.  Hence,  it  would  be  expected  a 
smaller  efficiency  in  the  case  of  the  MUSCL  scheme,  exactly  as  observed  for  the 
pressure  task.  But  the  results  do  not  confirm  this  expectation.  It  is  believed  that  the 
reason  for  his  behaviour  is  the  following. 

If  the  turbulent  fluctuations  are  small,  the  mean  values  of  the  properties  (e.g., 
density  and  temperature)  are  directly  obtained  from  the  mean  mixture  fraction, 
neglecting  those  fluctuations.  If  they  are  significant,  typically  when  the  mixture 
fraction  variance  exceeds  10'^,  then  the  mean  values  are  obtained  from  interpolation  of 
the  data  stored  in  tabular  form.  This  data  is  obtained  a  priori  accounting  for  the 
turbulent  fluctuations  for  a  range  of  mixture  fraction  and  mixture  fraction  variance 
values.  Although  the  interpolation  of  the  stored  data  is  relatively  fast,  it  is  still  more 
time  consuming  than  the  determination  of  the  properties  in  the  case  of  negligible 
fluctuations.  Therefore,  when  the  number  of  grid  nodes  with  significant  turbulent 
fluctuations  increases,  the  computational  load  increases  too.  Since  the  calculations 
start  from  a  mixture  fraction  variance  field  uniform  and  equal  to  zero,  a  few  iterations 
are  needed  to  increase  the  mixture  fraction  variance  values  above  the  limit  of  10’^.  The 
results  given  in  tables  1  and  2  were  obtained  using  a  different  number  of  iterations,  30 
for  the  MUSCL  scheme  and  80  for  the  hybrid  scheme.  So,  it  is  expected  that  in  the 
former  case  the  role  of  the  turbulent  fluctuations  is  still  limited  compared  to  the  last 
case.  This  means  that  the  computational  load  per  iteration  will  be  actually  higher  for 
the  hybrid  scheme,  rather  than  identical  in  both  cases  as  initially  assumed.  This  would 
explain  the  similar  task  efficiency  observed  for  the  two  schemes. 

The  three  remaining  tasks,  i),  iii)  and  iv),  exhibit  a  similar  behaviour,  the 
efficiency  being  higher  for  the  calculations  using  the  MUSCL  scheme.  This  is 
explained  exactly  by  the  same  reasons  given  for  the  overall  efficiency.  In  both  cases, 
the  efficiency  of  these  tasks  is  higher  than  the  overall  efficiency,  compensating  the 
smaller  efficiency  of  the  pressure  task.  For  a  small  number  of  processors  the  efficiency 
of  these  tasks  slightly  exceeds  100%.  This  has  also  been  found  by  other  researchers 
and  is  certainly  due  to  a  better  use  of  cache  memory. 

Tables  3  and  4  summarize  the  results  obtained  for  a  complete  run,  i.e.,  for  a 
converged  solution,  using  32,  64  and  128  processors.  Convergence  is  faster  if  the 
hybrid  scheme  is  employed,  as  expected.  Regardless  of  the  discretization  scheme,  there 
is  a  small  influence  of  the  number  of  processors  on  the  convergence  rate,  and  although 
this  rate  tends  to  decrease  for  a  large  number  of  processors,  it  does  not  change 
monotonically.  The  complex  interaction  between  different  phenomena  and  the  non¬ 
linearity  of  the  governing  equations  may  be  responsible  for  the  non-monotonic 
behaviour  which  has  also  found  in  other  studies.  Since  the  smaller  number  of 
processors  used  in  these  calculations  was  32,  the  efficiency  and  the  speedup  were 


94 


VECPAR’98  -  3rd  International  Meeting  on  Vector  and  Parallel  Processing 


Table  3.  Parallel  performance  of  the  code  using  the  hybrid  discretization  scheme 


Hd 

32 

64 

128 

Partition 

2x4x4 

2x4x8 

2x8x8 

1758 

1749 

1808 

Itotal  (^) 

1982 

1107 

701 

^rel 

1 

1.79 

2.83 

Erel  (%) 

— 

89.5 

70.7 

£rel,  vel  (^) 

— 

95.0 

83.0 

^rel,  p  (^) 

— 

86.3 

58.9 

^rel,  k,  E 

— 

91.6 

78.1 

£rel,  scalars  (^) 

— 

90.9 

77.5 

^rel,  prop  (^) 

— 

99.1 

92.6 

Erel.  radiation  (%) 

— 

82.7 

58.1 

Table  4.  Parallel  performance  of  the  code  using  the  MUSCL  scheme 


Hd 

32 

64 

128 

Partition 

2x4x4 

2x4x8 

2x8x8 

ttiter 

3244 

3215 

3257 

^total  (®) 

7530 

4119 

2524 

Srel 

1 

1.83 

2.98 

Etel  (%) 

— 

91.4 

74.6 

£rel,  vel  (^) 

— 

98.1 

90.7 

^rel,  p  (^) 

— 

82.8 

54.5 

^rel,  k,  e  (*^) 

— 

96.5 

86.4 

£rel,  scalars  (^) 

— 

97.3 

88.3 

^rel,  prop  (*^) 

— 

96.1 

87.9 

Erel.  radiation  (%) 

— 

75.6 

49.0 

computed  taking  the  run  with  32  processors  as  a  reference.  This  means  that  the  values 
presented  in  tables  3  and  4,  denoted  by  the  subscript  rel,  are  relative  efficiencies  and 
speedups.  The  relative  speedup  is  higher  when  the  MUSCL  scheme  is  employed,  in 
agreement  with  the  trend  observed  in  tables  1  and  2  for  the  first  few  iterations. 

There  are  two  tasks  that  exhibit  a  much  smaller  efficiency  than  the  others:  the 
solution  of  the  pressure  correction  equation  and  the  calculation  of  the  radiative  heat 
transfer.  The  low  efficiency  of  the  radiative  heat  transfer  calculations  is  due  to  the 
decrease  of  the  convergence  rate  with  the  increase  of  the  number  of  processors  [19-20]. 
The  radiation  subroutine  is  called  with  a  certain  frequency,  typically  every  10  iterations 
of  the  main  loop  of  the  flow  solver  (SIMPLE  algorithm).  The  radiative  transfer 
equation  is  solved  iteratively  and  a  maximum  number  of  iterations,  10  in  the  present 
work,  is  allowed.  If  the  number  of  processors  is  small,  convergence  is  achieved  in  a 
small  number  of  iterations.  But  when  the  number  of  processors  is  large,  the  limit  of 
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10  iterations  is  achieved,  and  a  number  of  iterations  smaller  than  this  maximum  is 
sufficient  for  convergence  only  when  a  quasi-converged  solution  has  been  obtained. 
Both  the  pressure  and  the  radiation  tasks  have  a  lower  partial  efficiency  if  the  MUSCL 
scheme  is  used.  In  fact,  the  computational  effort  of  these  tasks  is  independent  of  the 
discretization  scheme,  and  the  communication  time  is  higher  for  the  MUSCL  scheme. 
The  same  is  true,  at  least  after  the  first  few  iterations,  for  the  properties  task.  The 
other  tasks  (momentum,  turbulent  quantities  and  scalars)  involve  the  solution  of 
transport  equations  and  their  computational  load  strongly  depends  on  how  the 
convective  terms  are  discretized.  Hence,  their  efficiencies  are  higher  than  the  overall 
efficiency,  the  highest  efficiencies  being  achieved  for  the  MUSCL  scheme. 


5  Conclusions 

The  combustion  chamber  of  a  power  station  boiler  was  simulated  using  a  Cray  T3D 
and  a  number  of  processors  ranging  from  1  to  128.  The  convective  terms  of  the 
governing  equations.were  discretized  using  either  the  hybrid  central  differences/upwind 
or  the  MUSCL  schemes,  and  a  comparison  of  the  parallel  efficiencies  attained  in  both 
cases  was  presented.  The  MUSCL  scheme  is  more  computationally  demanding,  and 
requires  more  data  to  be  exchanged  among  the  processors,  but  it  yields  higher  speedups 
than  the  hybrid  scheme.  An  examination  of  the  computational  load  of  different  tasks  of 
the  code  shows  that  two  of  them  are  controlling  the  speedup.  These  are  the  solution  of 
the  pressure  correction  equation,  which  requires  a  lot  of  communications  among 
processors,  and  the  calculation  of  the  radiative  heat  transfer,  whose  convergence  rate  is 
strongly  dependent  on  the  number  of  processors.  The  efficiency  of  these  tasks,  as  well 
as  the  efficiency  of  the  properties  calculation  task,  is  higher  for  the  hybrid  than  for  the 
MUSCL  schemes.  On  the  contrary,  the  efficiency  of  the  tasks  that  involve  the 
solution  of  transport  equations  is  higher  for  the  MUSCL  than  for  the  hybrid  scheme, 
and  it  is  also  higher  than  the  overall  efficiency. 
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Abstract.  Some  aspects  of  the  parallel/ vector  implementation  of  an 
adaptive  edge-based  high-resolution  scheme,  for  the  solution  of  compress¬ 
ible  Euler  equations  on  unstructured  grids,  on  current  shared  memory 
supercomputers  are  presented.  We  address  the  use  of  an  alternative  data 
structure,  known  ais  superedge,  which  groups  together  several  edges  and 
which  attempts  to  find  a  good  balance  between  floatingpoint  (flop)  and 
indirect  addressing  (i/a)  operations.  It  is  shown  that  the  practical  use- 
fuhiess  of  the  flow  solver  has  been  dramatically  improved  by  efficient 
implementation  on  high  performence  computer  configurations  and  also 
that  switching  from  edge-based  to  a  superedge  data  structured  is  not 
worthwhile  for  codes  which  already  have  suficiently  high  rate  between 
(flop)  and  (i/a). 


1  Introduction 

In  recent  yeans,  there  has  been  a  significant  level  of  research  into  the  application 
of  unstructured  mesh  methods  to  the  simulation  of  fluid  dynamic  problems. 
For  unstructured  triangular  and  tetrahedral  meshes,  major  progress  has  been 
made  in  the  areas  of  automatic  mesh  generation  and  flow  solver  accuracy  [1, 
2],  However,  the  storage  of  mesh  connectivity  information  increases  the  use  of 
computer  memory  and  indirect  addressing  to  retrive  local  information  required 
for  the  flow  solver  algorithm.  To  reduce  (i/a)  and  memory  requirements,  finite 
element  schemes  based  on  edge-based  data  structures  have  been  introduced  by 
Morgan  et  al.  [3],  inspired  on  the  finite  volume  schemes  (see  Barth  in  [4]).  The  use 
of  an  edge-ba,sed  data  structure  also  enables  a  straightforward  implementation 
of  upwind-bia,sed  schemes  in  the  context  of  finite  element  methods. 

In  this  paper,  an  upwind  biased  high-resolution  flux-split  algorithm  is  used 
as  the  general  approach  for  constructing  high-resolution  schemes.  An  adaptive 
edge-based  Galerkin  finite  element  formulation  is  used  as  the  building  block  for 
the  multidimension  generalization  of  the  essentially  one-dimensional  upwinding 
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concepts.  The  resultant  flow  solver  is  used  for  the  solution  of  compressible  Euler 
equations  on  unstructured  grids.  A  simple  explicit  time  integration  is  adopted 
to  drive  the  solution  towards  a  steady-state.  For  a  detailed  description  on  the 
flow  solver  algorithm  and  related  issues  see  Lyra  [2]. 

The  number  of  repeated  evaluation  of  right-hand  sides  (R.HS)  or  residuals  is 
quite  large  and  time  consuming  with  explicit  upwind-like  schemes.  The  use  of 
eflhcient  data  structures,  searching  algorithms  and  implementations  is  fundamen¬ 
tal.  The  main  steps  adopted  for  a  parallel/vector  implementation  on  the  CRAY 
J90  of  the  flow  code  using  either  an  edge-based  or  the  alternative  superedges 
data  structure  are  described.  The  techniques  used  are  simple  and  try  to  reduce 
investiment  in  man-hours  when  re-writing  the  code  for  the  use  of  alternative  data 
structures  [5].  These  issues  will  be  discussed  in  detail.  Finally,  we  will  present  a 
comparative  performance  study,  on  different  computer  plataforms,  of  edge-based 
and  triangular  superedges  schemes  for  the  solution  of  a  typical  two  dimensional 
model  problem  of  a  supersonic  flow  past  a  circular  cylinder. 


2  Numerical  Solution  Algorithm 

2.1  Edge-Based  Finite  Element  Formulation 

Assuming  that  the  spatial  2-D  domain  f?  is  discretized  into  an  unstructured  as¬ 
sembly  of  linear  triangular  elements,  after  employing  the  Galerkin  finite  element 
approximation,  the  resultant  discrete  formulation  can  be  conveniently  expressed 
as 


=  -  S 


IIs 


S=l 


+  (J2Df{4F';  +  2Fl  +  F}'-F})))j  . 

/=i 


(1) 


where  an  edge-based  data  structure  has  been  employed  instead  of  the  conven¬ 
tional  finite  element  data  structure  which  is  based  on  the  connectivity  of  the 
elements.  This  allows  a  direct  implementation  of  different  types  of  standard  1-D 
upwind  or  centered  shock-capturing  methods  within  an  unstructured  grid  con¬ 
text  [2,  3]. 

The  extension  of  the  one  dimensional  upwinding  concepts  to  two-dimensional 
generic  discretisations  consists  on  the  use  of  an  edge-based  Galerkin  finite  ele¬ 
ment  formulation  together  with  different  strategies  to  build  a  local  1-D  like 
“structured’'  stencil  by  means  of  an  interpolation  reconstruction  technique  [6]. 
It  is  w'ell  known  that  the  use  of  this  data  structure  has  additional  beneficial 
effects,  in  terms  of  both  processing  time  and  memory  requirements  [3].  These 
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effects  will  be  of  particular  importance  when  the  extension  of  these  methods  to 
the  solution  of  large  scale  three  dimensional  problems  is  done. 

In  (1).  »)/  is  the  number  of  sides  connected  to  node  1.  the  second  term  on 
the  right  hand  side  denotes  a  correction  required  for  the  nodes  1  which  lie  on 
the  boundary  of  the  computational  domain  and 

Ciis:  =  (Cjj^,Cjj^.)  \  £//5  =  |C'//sh 

Here,  C\j.  and  Df  are  coefficients  which  depends  on  the  finite  element  trial 
functions.  From  the  asymmetry  of  the  edge  weights,  the  numerical  discretization 
.scheme  can  be  immediately  observed  to  possess  a  conservation  property,  in  the 
sense  that  the  sum  of  the  contributions  made  by  any  interior  edge  is  zero  [3]. 
Finally.  M  represents  the  finite  element,  consistent,  mass  matrix. 

Practical  algorithms  for  the  Euler  equations  can  be  produced  by  evaluating 
a  convenient  numerical  flux  in  the  direction  of  the  weighting  coefficient 

in  the  place  of  the  actual  flux  Ff^^.  In  this  work  the  flux  difference  scheme 
proposed  by  Roe  [7]  is  adopted  as  the  lower-order  stable  formulation. 

2.2  High-Resolution  Scheme 

The  MUSCL  scheme  [Monotonic  Upstream-centered  Schemes  for  Conservation 
Laws),  proposed  by  Van  Leer  [8],  is  used  to  build  high-resolution  schemes.  The 
higher-order  slope-limited  MUSCL  can  be  defined  through  the  numerical  flux 

F  F^UnjSjj^  -  1-4^(771 , f>K)|  [Ur  -  ■  (3) 

where  a  piecewise  linear  reconstruction,  with  the  intrc^uction^f  non-linear  lim¬ 
iters,  is  used  to  compute  the  limited  interface  values  Ul  and  Ur. 

The  key  point  of  this  class  of  higher-order  procedure  is  the  extension  of 
the  support  of  the  lower-order  stencil  and  the  guarantee  of  the  monotonicity 
property  of  the  lower-order  scheme.  For  a  detailed  description  on  the  different 
formulations,  for  a  discussion  on  several  possibilities  for  the  construction  of  the 
extended  stencil,  the  choice  of  the  limiter  functions  and  other  related  issues 
see  Lyra  [2]  and  Lyra  et  al.  [6],  Several  different  limiter  functions,  which  may  be 
computed  using  an  upwdnd  stencil  or  a  symmetric  stencil,  are  available.  Here,  the 
primitive  variables  are  chosen  to  be  limited  for  economical  reasons.  Numerical 
evidence  supports  this  choice,  since  no  oscillation,  or  ver>-  little  oscillation,  is 
present  in  the  solution.  The  limit  function  given  in  [9]  was  used  in  the  analysis 
presented  in  this  paper. 

Equation  (1)  represents  the  time  evolution  of  the  unknown  \'ector  f  /(O  at 
node  1  of  the  mesh.  A  practical  solution  algorithm  is  then  produced  by  further 
discretizing  the  time  dimension,  with  a  simple  explicit  time  stepping  scheme.  The 
consistent  finite  element  "mass"  matrix  M  is  replaced  by  the  standard  lumped 
(diagonal)  mass  matrix  Mi.  This  enables  a  truly  explicit  time  integration  and 
does  not  alter  the  final  steady  state  solution,  which  is  of  primary  coiK'ern  here. 
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For  the  steady-state  analysis  studied,  local  time  stepping  was  used  to  accelerate 
the  convergence  rate  towards  steady-state. 


3  The  Computational  Implementation  and  Other  Issues 

3.1  Mesh  Generation  and  Adaption 

The  unstructured  triangulations  adopted  for  the  2-0  computations  were  gener¬ 
ated  with  the  advancing  front  technique.  An  adaptive  mesh  enrichment  proce¬ 
dure  for  steady  state  solution  was  used  to  improve  the  accuracy  of  the  inviscid 
computations  analysed.  The  error  estimates  are  ba,sed  upon  concepts  from  in¬ 
terpolation  theory  and  are  used  to  control  automatically  the  adaptivity.  Further 
details  about  the  mesh  generator,  error  analysis  involved  in  the  procedure  and 
about  the  adaptive  procedure  itself  can  be  found  in  Morgan  et  al.  [3], 

3.2  Data  Structure 

The  standard  finite  element  data  structured  consists  of  the  physical  coordinates 
simply  listed  by  node  numbers,  a  list  of  the  connectivity  of  each  element  and 
a  list  of  boundary  edges  connectivities.  With  this  geometrical  and  topological 
data,  the  integral  terms  that  appear  on  the  finite  element  formulation  of  the 
problem  can  be  calculated  with  a  loop  over  the  elements  and  a  loop  over  the 
boundary  edges  with  the  contributions  to  the  nodes  being  accumulated  during 
the  process. 

.\s  an  alternative  to  the  element-based  data  structure,  we  can  represent  an 
unstructured  grid  in  terms  of  an  edge-based  data  structured.  The  physical  coor¬ 
dinates  are  simply  listed  by  node  numbers  and  a  list  of  boundary  edge  connectiv¬ 
ities  is  adopted,  but  now  the  topology  inside  the  domain  is  characterized  through 
the  edges  and  their  connectivities.  A  significant  reduction  in  gather/scatter  costs 
and  memory  requirements  can  be  realized  by  going  from  an  element-based  to  an 
edge-based  data  structure  (see  Luo  et  al.  [10],  Morgan  et  al.  [.3]  and  Martins  et 
al.  [11]),  this  being  more  pronunced  in  three  dimensional  simulations. 

The  use  of  edge-based  data  structure  in  CFD,  as  opposed  to  the  element- 
based  data  structure,  has  the  following  well  stablished  advantages:  a)  better 
performance  of  the  numerical  procedures  in  terms  of  computational  efficiency 
(the  computational  effort  necessary  to  evaluate  and  assemble  edge  contributions 
to  the  nodes  is  significantly  reduced  when  compared  with  the  element  coun¬ 
terpart  and  the  indirect  addressing  is  also  reduced);  b)  allows  straightfoward 
construction  of  different  flow  solver  algorithms,  by  generalizing  one-dimensional 
algorithms  (from  centered  to  upwinding  approaches);  c)  easy  to  implement  nu¬ 
merical  schemes  for  both  2D  and  3D  applications,  also  due  to  the  1-D  like  data 
structure:  d)  satisfaction  of  discrete  conservation  property. 

Lohner  [5]  pointed  out  that  the  expected  reduction  in  the  total  CPI’  cost  can 
be  partially  lost  if  the  indirect  addressing  overhead  accounts  for  the  major  part 
of  the  total  CPU  requirements.  With  the  design  criterion  of  operating  as  much  as 
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possible  oil  gathered  data,  Ldhner  presents  several  alternative  data  structures. 
According  to  Lohner  the  adoption  of  stars  data  structure  leads  to  very  small 
i/a  reduction  with  the  penalty  of  the  necessity  of  major  code  rewriting  and 
with  possible  drawbacks  in  terms  of  large  band  widths  and  cache-misses.  The 

u. se  of  chains  data  structure  can  lead  to  good  i/a  reduction  but  would  prevent 
implementation  on  vector  processors. 

Analysing  the  main  characteristics  of  the  proposed  alternative  data  structures 
suggested  by  Lohner  [5],  the  superedge  alternative  seems  to  represent  the  best 
compromise,  being  suitable  for  scalar  or  vector  implementations.  Martins  et 

al.  [11]  have  demonstrated  that  migrating  from  an  edge-based  data  structured 
to  a  superedge  data  structured  would  lead  to  at  least  20%  gain  in  CPU  usage 
for  three  dimensional  potential  flows.  We  will  address  the  use  of  superedges  for 
the  explicit  solution  the  Euler  equations  using  high-resolution  schemes. 

3.3  Pre-Processing 

The  generated  grids,  either  initial  or  refined,  are  provided  in  the  conventional 
element-based  data  structure  format.  Thus,  a  pre-processing  of  the  grid  must  be 
undertaken  before  it  can  be  used  with  an  edge-based  flow  analysis  algorithm. 
After  the  preprocessor  stage  the  element-based  data  structure  can  be  discarded. 
The  pre-processor  stage  consists  basically  on  the  following  steps: 

1.  Build  the  arrays  with  the  grid  and  boundary  topology,  which  are  lists  of 
edges  and  boundary  faces  with  their  respective  connectivities; 

2.  Using  superedges,  group  edges  as  superedges  and  organize  the  remaining 
edges; 

3.  Compute  and  store  the  edges  and  boundary  faces  weighting  coefficients  [2J; 

4.  Find  and  store  the  required  information  necessary  for  the  use  of  the  dummy 
nodes; 

5.  Employ  a  colouring  algorithm  [12]  to  group  the  edges,  superedges  and  bound¬ 
ary  faces  in  such  a  way  that  no  repetition  in  the  node  numbering  occurs 
amongst  items  of  the  same  group; 

Remark  1.  The  information  required  to  describe  an  unstructured  nlesh  is  mini¬ 
mal  when  using  an  edge-based  data  structure.  A  hash  table  seaiching  technique 
is  used  to  extract  the  edges  from  the  original  data  structure. 

Remark  2.  As  the  number  of  edges  in  a  group  increases  the  longer  and  more 
complicate  loops  have  to  be  implemented.  Therefore,  as  in  the  words  of  Ldhner 
[-5]  "a  balance  has  to  be  struck  between  efficiency  and  code  simplicity,  clarity 
and  maintenance  .  A  triangle  superedge  was  adopted  so  that,  besides  the  above 
reasons,  very  simple  grouping  algorithm  can  be  devised,  i.e.  grouping  the  three 
sides  of  a  triangle  as  a  superedge  in  such  a  way  that  as  much  three-edge  groups 
as  possible  are  formed  and  then  organize  the  remaining  edges  [11], 

Remark  S.  The  three  nodes  of  the  triangle  that  contains  the  dummy  node,  and 
two  shape  functions  evaluated  at  the  dummy  node  for  the  interpolation  step,  are 
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kept  in  memory  for  each  of  the  two  dummy  nodes  that  belong  to  each  side.  This 
procedure  represents  a  memory  overhead  of  ten  times  the  total  number  of  sides 
in  a  2-D  computation,  but  such  reconstruction  approach  is  very  robust  for  high 
speed  flow  simulation  and  recommended  in  such  flow  regimes.  The  alternating 
digital  tree  [12]  algorithm  is  adopted  for  the  searching  operations  required.  Other 
reconstruction  techniques  which  incur  in  no  or  very  litle  memory  overhead  [2,  6] 
can  be  employed. 

Remark  4.  The  colouring  algorithm  [12]  is  used  to  prevent  recurrence  inside  the 
loops  used  in  the  flow  solution  algorithm,  therefore  allowing  vector  and  parallel 
processing  of  these  loops. 

3.4  Parallel/ Vector  Implementation 

The  operations  performed  inside  the  loops  over  the  edges  and  boundary  faces, 
which  take  place  in  the  flow'  solver  edge-based  algorithm  are;  gather  information 
from  the  nodes  of  each  edge;  operate  on  this  information:  scatter  the  results  back 
to  the  nodes  of  the  edges  and  add  them  to  the  nodal  quantities. 

These  typical  loops  are  entirely  vectorizable  provided  each  group  (colour) 
of  superedges  and/or  edges,  or  boundary  faces,  is  executed  separately  and  an 
appropriate  compiler  directive  (e.g.  !D1R$  IVDEP,  for  CRAY  supercomputers) 
instructing  the  compiler  is  inserted  before  the  vectorizable  inner  loop.  Essentially, 
these  loops  w’ould  be  w'ritten  like  the  one  shown  in  Fig.  1.  Clearly,  some  details 
habe  been  omitted  for  conciseness. 

*♦*  Loop  over  all  colours 

do  20  iblock  =  1,  nblock 

Compute  first  and  last  edges  in  this  colour 

***  Loop  over  all  edges  in  this  color 

!DIR$  IVDEP 

do  10  is  =  isfirst,  islast 
inodel  =  iside(is,l). 
inode2  =  iside(is,2) 

Compute  contributions  al  from  side  is 

avar(inodel)  =  avar(inodel)  +  al 
■  avar (inode2)  =  avar(inode2)  -  a2 
10  continue 

20  continue 


Fig.  1.  Loop  over  edges 


where  nblock  is  the  number  of  colours  into  which  the  mesh  w'as  dividided. 
isfirst  and  islast  are  the  first  and  last  edges  for  each  colour,  iside  is  the 
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arrav  with  the  nodal  connectivity  of  edges,  avar  is  the  variable  to  be  updated 
with  this  loop  and  al  and  a2  are  the  local  contribuitions  to  avar  of  the  nodes 
of  edge  is.  The  computation  of  the  local  contributions,  not  shown  in  Fig.  1.  can 
demand  a  large  number  of  operations,  up  to  more  than  one  hundred  fortran 
statements  in  some  routines  used  in  the  second  order  scheme. 

To  implement  superedge  loops,  all  loops  such  as  the  one  shown  in  Fig.  1 
were  modified  to  first  loop  over  all  triangular  superedges,  and  then  loop  over  the 
remaining  edges  of  the  mesh.  The  results  of  these  modifications  are  loops  such 
as  the  one  shown  in  Fig.  2. 

It  is  important  to  observe  that  the  calculations  for  the  local  contributions  of 
the  nodes,  also  not  shown  in  Fig.  2,  are  identical  to  those  in  the  loop  over  the 
edges.  So  these  changes  could  be  made  somewhat  mechanically,  using  a  “cut  and 
paste"  facility  of  a  modern  text  editor.  This  is  not  recommended,  however,  since 
this  procedure  is  very  labour  intensive,  error  prone  and  would  result  in  extremely 
long  procedures  that  seriously  compromise  the  readability  and  the  longer  term 
maintenance  of  the  computer  program. 

To  avoid  this  problem,  we  collected  all  repeated  computations  into  separated 
subroutines.  This  is  a  classical  technique  to  improve  code  modularity  and  re¬ 
usability.  which  has  not  been  widely  used  in  vector  processing  because  compilers 
normally  cannot  vectorize  loops  with  call  to  arbitrary  subroutines.  To  oveicome 
this  serious  drawback,  we  used  the  inline  facility  (present  in  most  modern  com¬ 
pilers.  even  for  sequential  machines),  that  expands  the  source  code  of  the  called 
routine  into  the  body  of  the  calling  program,  before  compilation.  The  compiler 
therefore  sees  no  function  call  and  can  vectorize  and  paralellize  the  loops  as 
usual.  This  procedure  allows  easy  implementation  of  different  alternative  data 
structures  with  very  little  effort  for  coding  changes,  once  the  pre-processed  data 
is  available. 

The  inner  loops  were  distributed  to  multiple  processors  using  the  autotask- 
ing  facilities  available  in  CRAY  machines.  In  the  cases  where  loops  were  not 
automatically  tasked,  even  when  the  update  of  the  variables  in  the  loop  was 
independent  because  of  the  colouring,  we  inserted  a  compiler  directive  [!MIC3t 
DO  ALL  AUTOSCOPE  VECTOR),  similar  to  that  adopted  for  vectorization. 
to  force  the  tasking  of  the  loop. 

The  parallel  regions  in  the  code  should  be  as  long  as  possible,  so  that  parallel 
start-up  costs  are  reduced.  Apart  from  the  size  of  the  problem  being  analysed, 
a  proper  balance  on  the  number  of  components  (edges,  superedges  or  boundary 
faces)  in  each  colour  is  fundamental  for  an  effective  parallel  speed  up.  Longer  par¬ 
allel  regions  can  be  obtained  if  we  apply  a  domain  decomposition  and  parallel 
directives  applied  at  a  higher  level  in  the  program.  This  has  not  been  imple¬ 
mented  yet  because  it  would  imply  into  a  more  complex  rewriting  of  the  code, 
since  we  would  need  colourings  on  two  levels,  special  treatment  for  the  boideis 
between  domains  and  some  other  issues. 
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♦**  Loop  over  all  colours 

do  20  iblock  =  1 ,  nblock 

Compute  first  and  last  superedges  in  this  colour 

***  Loop  over  all  superedges  in  this  color 

!DIR$  IVDEP 

!MIC$  DO  ALL  AUTOSCOPE  VECTOR 

do  10  is  =  isfirst,  islast ,  3 
inodel  =  iside(is,l) 
inode2  =  iside(is+l , 1) 
inodeS  =  iside(is+2,l) 

♦**  Compute  contribution  al  from  edge  is 

*♦*  Compute  contribution  a2  from  edge  is*l 

Compute  contribution  a3  from  edge  is-i-2 

avar( inodel)  =  avar( inodel)  +  al  -  a3 

avar(inode2)  =  avar(inode2)  +  a2  -  al 

avar(inode3)  =  avar(inode3)  +  a2  -  a2 

10  continue 

20  continue 

**  Loop  over  all  colours  of  remaining  edges 

do  40  iblock  =  1,  nblockr 

Compute  first  and  last  edges  in  this  colour 

Loop  over  all  remaing  edges  in  this  color 

!DIR$  IVDEP 

!MIC$  DO  ALL  AUTOSCOPE  VECTOR 

do  30  is  =  isfirst,  islast 
inodel  =  iside(is,l) 
inode2  =  iside(is,2) 

**♦  Compute  contribution  al  from  side  is 

avar(inodel)  =  avar(inodel)  +  al 
avar(inode2)  =  avar(inode2)  -  a2 
30  continue 

40  continue 

Fig.  2.  Loop  over  superedges 
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Fig.  3.  Steady  flow  past  a  cylinder  at  a 
Mach  number  contours. 


4  Numerical  Results  and  Conclusions 

4.1  Numerical  Application 


This  problem  consists  of  a  steady  flow  past  a  circular  cylinder,  at  a  fiee  stieam 
Mach  number  of  3.  The  presence  of  sonic,  stagnation  and  rarefaction  zones  makes 
this  problem  challenging  in  terms  of  stability  behavior.  The  final  mesh,  following 
one  adaptation,  together  with  the  corresponding  Mach  number  contours  are 
shown  in  Fig.  3.  This  mesh  consists  of  24,979  elements  and  12.651  nodes.  Note 
that  both  the  bow  shock  and  the  quasi-rarefaction  zone  behind  the  cylinder  are 
well  represented,  with  the  recirculation  and  the  weak  shocks  captured.  A  detailed 
discussion  on  the  numerical  prediction  using  different  flow  solvers  can  be  found 
in  [2]  and  is  not  of  interest  here. 
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4.2  Performance  Studies 

The  CPU  time  for  preprocessing  the  data,  i.e.,  to  form  the  edge  or  superedge  data 
structure  from  the  conventional  element-based  data  structure  and  to  do  several 
pre-computations,  is  less  then  0.5  %  of  the  total  CPU  time  for  the  analysis  and 
so  it  is  neglegible.  The  small  time  spent  in  the  preprocessing  step  reflects  the 
fact  that  it  is  done  only  once  per  analysis  and  also  the  use  of  data  structures  and 
algorithms  which  enable  efficient  sorting  and  searching  operations  (hash  tables, 
alternating  direct  trees,  etc.). 

The  mesh  shown  in  the  figure  3  has  37,630  edges.  For  the  analysis  using  the 
superedge  program  the  pre-processor  was  able  to  transform  86.75%  of  the  edges 
into  superedges  (10881  triangle  superedges),  while  13.25%i  (4987)  remained  as 
single  edges. 

The  performance  of  this  test  case  was  measured  on  three  different  computers, 
a  486  PC,  a  SUN  UltraSparc  I  workstation  and  a  CRAY  J90  vector/parallel 
computer.  The  processing  (CPU)  times  spent  to  simulate  5000  time  steps  are 
show  in  table  1.  The  analysis  includes  the  program  using  either  an  edge  or 
superedge  data  structure,  considering  both  first-order  and  higher-order  schemes. 


Table  1.  CPU  times  in  seconds 


Computer 

Edges 

Superedges 

1®'  order 

2"“^  order 

1®^  order 

2'”^  order 

PC 

34144 

66837 

32784 

64190 

SUN 

4449 

5272 

3356 

5167 

Cray 

2170 

3543 

2152 

3496 

By  analysing  table  1,  we  see  that  we  can  achieve  only  a  minor  advantage 
when  using  superedge  instead  of  edge  data  structure  for  this  application.  The 
time  savings  for  the  PC  is  just  around  4%),  for  the  SUN  workstation  it  is  from 
2%  to  4%'  and  for  the  single  processor  CRAY  supercomputer  it  is  approximatly 
1%. 

The  ratio  between  the  number  of  floating  point  and  indirect  addressing  oper¬ 
ations  is  already  big  enough  for  upwind-ba.sed  schemes,  as  already  expected.  For 
instance,  for  the  subroutine  which  computes  Roe's  approximate  Riemann  solver 
we  have  a  ratio  of  approximately  6  and  8  floating  points  per  indirect  addressing 
for  the  edge  and  superedge  data  structure  respectively. 

We  have  analysed  the  first-order  scheme  for  1000  iterations  on  a  single  pro¬ 
cessor  of  CRAY  J90  preventing  vectorization.  In  this  case  we  measured  2T24 
seconds  and  1891  seconds  for  the  CPU  time  with  the  edge  and  superedge  data 
structure  respectively.  The  difference  was  surprisingly  much  bigger  then  previ¬ 
ously.  being  about  12%  less  CPU  time  with  superedges.  This  is  probably  due  to  a 
bigger  CPU  overhead  on  indirect  addressing  when  scalar  processing  is  u.sed,  but 
CRAY'S  vector  computers  are  known  not  to  perform  well  without  vectorization. 


108 


VECPAR'98  ■  3rd  International  Meeting  on  Vector  and  Parallel  Processing 


Regarding  the  vector  performance,  the  Mflop/s  (Million  of  floating  point  op¬ 
erations  per  second)  rates  were  measured  on  single  CPU  computations  employing 
CRAY'S  perfview  tool.  The  sustained  job  performance  on  the  CRAY  J90  was 
around  57  Mflop/s  for  higher-order  analysis  and  around  50  Mflop/s  for  first-order 
analysis.  The  use  of  either  edge  or  superedge  data  structures  had  no  significant 
impact  on  the  Mflop  rates.  The  code  has  sequential  regions  which  justify  the 
above  values,  somewhat  lower  than  expected.  The  routines  which  have  more 
intensive  computations  and  which  are  highly  vectorized  reach  up  to  83  Mflop/s. 

Parallel  and  theoretical  speed-ups  were  estimated  with  CRA\  s  atexpert  tool 
on  four  processor,  using  the  F90  CRAY'S  compiler.  The  actual  and  theoretical 
speed-ups  are  presented  in  table  2.  The  theoretical  speed-up  in  general  is  in  good 
agrement  with  the  speed-up  obtained  on  a  dedicated  run. 


Table  2.  Parallel  performance  on  Cray  J90 


Edges 

Superedges 

1®'  order 

2"“^  order 

1®*^  order 

2"^  order 

Achieved  Speed-up 

1.9 

2.3 

1.44 

1.5 

Theoretical  Speed-up 

2.0 

2.5 

2.4 

3.1 

Serial  Code  Portion 

33.9% 

20.7%, 

21.6%, 

9.4%. 

It  was  observed  that  more  then  99%’  of  the  serial  portion  of  the  code  refers 
to  the  I/O  routines.  Such  routines  are  called  at  each  iteration  or  periodically  for 
printing  and  flushing  the  residuals  for  history  of  convergence  and  the  solution  for 
possible  re-starts  of  the  analysis,  respectively.  Those  subroutines  could  be  easily- 
changed,  reducing  substantially'  the  serial  portion  of  the  code  and  iiiipio\  ing 
vector  and  parallel  performance.  This  was  not  done,  however,  to  keep  the  code 
robust  for  new  challenging  applications. 

The  importance  of  load  balancing  in  the  efficiency  of  the  program  can  be  eas¬ 
ily  verified  in  the  routine  which  solves  the  Riemann  problem,  using  superedges. 
It  has  two  main  loops:  one  over  the  superegdes  and  another  one  ovei  the  lemain- 
ing  edges,  as  illustrated  in  Fig.  2.  The  second  loop,  over  the  edges,  is  exactly 
the  same  as  the  loop  used  to  solve  the  Riemann  problem  in  the  edge  based  pro¬ 
gram.  This  loo])  parallelizes  very  well,  and  in  the  edge  based  program  it  reaches 
a  speed-up  of  about  3.9.  In  the  superedge  program,  however,  the  speed-up  of  the 
corresponding  loop  was  only  1.9. 

With  superedges,  the  4987  remaining  edges  were  blocked  into  six  unbalanced 
colours  (2191:  1702;  678;  364;  43  and  9  edges  each),  an  artifact  of  the  colourin 
algorithm  used-,  while  with  the  edge  program,  which  used  a  different  colourin 
algorithm,  the  whole  37630  edges  were  blocked  into  ten  colours  with  a  balanced 
3763  edges  each.  The  increa.se  on  the  number  of  processor  cannot  be  efficient  if 
the  extra  processors  are  assigned  to  loops  so  short  that  the  overhead  for  setting 
up  the  parallel  loops  is  significant.  This  shows  the  importance  of  balancing  the 
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number  of  components  for  each  colour  and  also  that  only  for  complex  problems 
which  demand  very  large  meshes  can  we  expect  good  scalabity  and  efficiency-. 

4.3  Conclusions 

In  this,  paper  we  have  addressed  several  important  issues  for  simple  and  effective 
implementation  of  numerical  schemes  on  current  shared  memory  supercomput¬ 
ers.  Superedges,  and  other  alternative  data  structures,  are  interesting  when  the 
indirect  addressing  operations  account  for  an  important  portion  of  the  total 
computational  cost. 

The  use  of  superedges  when  solving  the  Euler  equations  using  explicit  high- 
resolution  upwind-based  schemes,  however,  does  not  pay  off.  On  a  three  dimen¬ 
sional  generalization  of  the  current  formulation  we  might  e.xpect  a  slightly  larger 
difference  in  the  run  times  between  edge  and  superedge  data  structure.  Finally, 
when  an  implicit  implementation  of  the  flow  solver  algorithm  is  devised,  using 
nonsy  metrical  solver  such  as  GMRES,  the  matrix  vector  multiplication  loop  will 
present  a  smaller  ratio  between  floating  point  and  indirect  addressing  operations, 
and  reducing  i/a  through  the  use  of  superedges  might  be  worthwhile  and  could 
be  attempted. 
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Abstract.  Theses  is  a  3D  panel  method  code,  which  calculates  the  char¬ 
acteristic  of  a  wing  in  an  inviscid,  incompressible,  irrotational.  and  steady 
curflow,  in  order  to  design  new  paragliders  and  sails. 

In  this  paper,  we  present  the  parallelization  of  Theses  for  low  cost  work- 
station/PC  clusters.  Theses  has  been  parallelized  using  the  ScaL.A.P.4CK 
library  routines  in  a  systematic  manner  that  lead  to  a  low  cost  develop¬ 
ment.  The  code  written  in  C  is  thus  very  portable  since  it  uses  only  high 
level  libraries.  This  design  was  very  efficient  in  term  of  manpower  and 
gave  good  performance  results.  The  code  performances  were  measured 
on  3  clusters  of  computers  connected  by  different  LANs  ;  an  Ethernet 
LAN  of  SUN  SPARCstation,  an  ATM  LAN  of  SUN  SPARCstation  and 
a  Myrinet  LAN  of  PCs.  The  last  one  was  the  less  expensive  and  gave  the 
best  timing  results  and  super-hnear  speedup. 


1  Introduction 

The  aim  of  this  work  is  to  compare  the  performance  of  various  parallel  platforms 
on  a  public  domain  aeronautical  engineering  simulation  software  similar  to  those 
routinely  used  in  the  aeronautical  industry  where  the  same  numerical  solver  is 
used,  with  a  less  user-friendly  interface,  which  results  in  a  more  portable  code 
(smaller  size,  no  graphic  library). 

Parallel  Theses  is  written  in  C  with  the  ScaL.APAGK[.3]  library  routines  and 
can  be  run  on  top  of  MPI[14]  or  PVM[6],  thus  the  application  is  portable  on  a 
wide  range  of  distributed  memory  platforms. 

We  introduce  in  the  following  the  libraries  that  are  necessary  to  understand 
the  ScaLAPACK  package.  Then  we  give  an  insight  of  the  parallelization  of  the 
code.  Tests  are  presented  before  the  conclusion. 

*  This  work  was  supported  by  EUREKA  contract  EUROTOPS,  LHPC  {Matra 
MSI,  CNRS,  ENS-Lyon,  INRIA,  Region  Rhone- Alpes),  INRIA  Rhone- Alpes  project 
REM.AP,  CNRS  PICS  program.  CEE  KIT  contract 


113 


FEUP  •  Faculdade  de  Engenharia  da  Universidade  do  Porto 


2  Software  libraries 

2.1  LAPACK,  BLAS  and  BLACS  libraries 

The  BL.4S  (Basic  Linear  Algebra  Subprograms)  are  high  quality  "building  block" 
routines  for  performing  basic  vector  and  matrix  operations.  Level  1  BLAS  do 
vector-vector  operations,  Level  2  BLAS  do  matrix-vector  operations,  and  Level  .3 
BLAS  do  matrix-matrix  operations, 

LAPACK[1]  provides  routines  for  solving  systems  of  simultaneous  linear 
equations,  least-squares  solutions  of  linear  systems  of  equations,  eigenvalue  prob¬ 
lems,  and  singular  value  problems.  The  associated  matrix  factorizations  (LU, 
Cholesky,  QR,  SVD,  Schur,  generalized  Schur)  are  also  provided.  The  LAPACK 
implementation  used  as  much  ais  possible  the  BLAS  building  block  to  ensure 
efficiency,  reliability  and  portability. 

The  BLACS  (Basic  Linear  Algebra  Communication  Subprograms)  are  ded¬ 
icated  to  communication  operations  used  for  the  parallelization  of  the  level  .3 
BLAS  or  the  ScaLAPACK  libraries.  They  can  of  course  be  used  for  other  ap¬ 
plications  that  need  matrix  communication  inside  a  network.  BLACS  are  not 
a  multi-usage  library  for  every  parallel  application  but  an  efficient  library  for 
matrix  computation. 

In  the  BLACS.  processes  are  grouped  in  one  or  two  dimension  grids.  BL.4CS 
provide  point  to  point  synchronous  receive,  broadcast  and  combine.  There  is 
also  routines  to  build,  modify  or  to  consult  a  grid.  Processes  can  be  enclosed  in 
multiple  overlapping  or  disjoint  grids,  each  one  identified  by  a  context.  Different 
release  of  BLACS  are  available  on  top  of  PVM,MPI  and  others.  In  this  project, 
BL.4CS  are  used  on  top  of  PVM  or  MPI. 

2.2  The  ScaLAPACK  library 

ScaLAPACK  is  a  library  of  high-performance  linear  algebra  routines  for  dis¬ 
tributed  memory  message  passing  MIMD  computers  and  networks  of  worksta¬ 
tions  supporting  PVM  and/or  MPI  .  It  is  a  continuation  of  the  LAPACK  project, 
which  designed  and  produced  analogous  software  for  workstations,  vector  su¬ 
percomputers,  and  shared-memory  parallel  computers.  Both  libraries  contain 
routines  for  solving  systems  of  linear  equations,  least  squares  problems,  and 
eigenvalue  problems.  The  goals  of  both  projects  are  efficiency  (to  run  as  fast 
as  possible),  scalability  (as  the  problem  size  and  number  of  processors  grow), 
reliability  (including  error  bounds),  portability  (across  all  important  parallel 
machines),  flexibility  (so  users  can  construct  new  routines  from  well-designed 
parts),  and  ease  of  use  (by  making  the  interface  to  LAPACK  and  ScaLAP.A.CK 
look  as  similar  as  possible).  Many  of  these  goals,  particularly  portability,  are 
aided  by  developing  and  promoting  standards  ,  especially  for  low-level  commu¬ 
nication  and  computation  routines.  LAPACK  will  run  on  any  machine  where 
the  BL.AS  are  available,  and  ScaLAP.ACK  will  run  on  any  machine  where  both 
the  BLAS  and  the  BLACS  are  available. 
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3  Implementation 


The  Thesee  sequential  code  [8]  uses  the  3D  panel  method  (also  called  the  sin¬ 
gularity  element  method),  which  is  a  fast  method  to  calculate  a  3D  low  speed 
airflow[9].  It  can  be  separated  in  3  parts  : 


Part  PI  The  fill  in  of  the  element  influence  matrix  from  the  3D  mesh.  Its 
complexity  is  0{n-)  with  n  the  mesh  size  (number  of  nodes).  Each  matrix 
element  gi'ves  the  contribution  of  a  double  layer  (source  +  vortex)  singularity 
distribution  on  facet  i  at  the  center  of  facet  j . 

Part  P2  The  LU  decomposition  of  the  element  influence  matrix  and  the  res¬ 
olution  of  the  associated  linear  system  (0(n3)),  in  order  to  calculate  the 
strength  of  each  element  singularity  distribution. 

Part  P3  The  speed  field  computation.  Its  complexity  is  (0(r7-)),  because  the 
contribution  of  every  nodes  has  to  be  taken  into  account  for  the  speed  calcu¬ 
lation  at  each  node.  Pressure  is  then  obtained  using  the  Bernoulli  equation 


Each  of  these  parts  are  parallelized  independently  and  are  linked  together 
by  the  redistribution  of  the  matrix  data.  For  each  part,  the  data  distribution  is 
chosen  to  insure  the  best  possible  efficiency  of  the  parallel  computation. 

The  rest  of  the  computation,  is  the  acquisition  of  the  initial  data  and  the 
presentation  of  the  results.  Software  tools  are  uses  to  built  the  initial  wing  shape 
and  to  modifv  it  as  the  results  of  the  simulation  gives  insight  that  are  valuable. 
The  results  are  presented  using  a  classical  viewer  showing  the  pressure  field  with 
different  colors  and  with  a  display  of  the  raw  data  in  a  window. 


3.1  Fill  in  of  the  influence  matrix 


The  3D  meshing  of  the  wing  is  defined  by  an  airfoils  file,  i.e.  by  points  around 
the  section  of  the  wing  that  is  parallel  to  the  airflow,  in  order  to  define  the  shape 
of  the  wing  in  this  section,  as  in  the  NACA  tables' .  Thus  two  airfoils  delimit  one 
strip  of  the  wing,  i.e.  the  narrow  surface  between  the  airfoils.  The  wmg  is  then 
divided  in  strips  and  each  strip  is  divided  in  facets  by  perpendicular  divisions 
joining  two  similar  points  of  the  airfoils.  This  meshing  is  presented  m  Figure  1. 
Notice  that  the  points  are  not  regularly  distributed  in  order  to  have  more  facets, 
thus  more  precision  in  the  computation,  in  areas  where  the  pressure  gradient  is 

greater. 


'  http:/' WWW. larc.nasa.gov/naca/ 
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Fig.  1.  Strip  of  the  meshing 


During  the  computation,  for  each  facet  the  whole  wing  must  be  examined 
(every  other  facet).  Since  the  wing  is  symmetrical,  the  computation  is  made  with 
a  half-wing,  thus  as  the  dimension  of  the  equation  system  is  2  x  n,  its  size  is 
divided  by  four.  The  results  for  the  whole  wing  are  then  deduced  from  those  of 
the  half-wing.  Let  I\  be  the  number  of  strip,  and  N  be  the  number  of  facets  per 
strip.  The  sequential  code  for  the  fill  in  of  the  influence  matrix  is  made  of  nested 
loops  ; 


/*  computation  of  the  influence  coefficient  (source)*/ 
for  (il=l ; il<=K/2; il++)  {  /*for  each  strip  ♦/ 
cstel=(il-l)*(N+2)+l; 
for  (jl=cstel; jl<=(cstel+N-l) ; jl++)  { 

/*  2  loops  to  examine  each  point  of  the  half  wing  */ 
for  (i2=l;i2<=K/2:i2++)  { 
cste2=(i2-l)*(N+2)+l; 
f or ( j2=cste2; j2<=(cste2+N) ; j2++) 

{ 


H[ind++]= 

} 


Fig.  2.  Sequential  code  for  the  influence  matrix  fill  in. 


The  computation  of  the  influence  coefficient  M[i]  for  one  facet  is  independent 
of  the  other  coefficients.  The  external  loop  can  be  split  up  straightforwardly  in 
order  to  parallelize  the  computation  on  different  proces.sors.  Each  processor  will 
then  compute  the  facets  of  a  given  number  of  strips.  The  strips  are  assigned  to 
a  processor  according  to  their  number.  For  instance,  with  4  processors  and  20 
strips  to  compute,  each  processor  will  compute  five  strips:  n‘T  strip  from  1  to  5, 
n''2  strip  from  6  to  10  etc. . . 
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Fig.  3.  Simple  assignation  of  the  strips  to  compute  to  the  processors. 


Hence  the  external  loop  becomes: 
for  ( i l=ce il ( (MyPRow) * (K/2) / ( (NPRow) ) ) + 1 ; 

il<=ceil((MyPRow+l)*(K/2)/(NPRow)) ; 

il  ++) 


Fig.  4.  The  parallel  version  of  the  external  loop  corresponding  to  Figure  3  assignment. 


The  data  needed  for  these  computations  (the  initial  meshing  of  the  wing)  are 
distributed  to  each  processor  using  the  PDGEMR2D  routine  from  the  ScaLA- 
PACK  parallel  library  [3],  the  results  is  then  gathered  with  the  same  routine. 
This  routine  provides  the  distribution  of  data  between  virtual  grids  of  processors 
with  any  kind  of  block  cyclic  data  distribution.  We  used  it  from  the  initial  "grid 
of  size  1  X  1  witch  contains  the  matrix  to  compute,  to  the  computation  virtual 
grid  of  processors  with  2  to  4  processors  arranged  in  a  1  x  2  or  1  x  4  shape  with, 
for  instance,  a  full  block  data  distribution.  Notice  that  this  data  distribution  in¬ 
troduces  no  other  communication  cost  in  this  part  because  it  is  embarrassingly 
parallel. 

The  second  inner  step  of  this  part,  the  computation  of  doublet  influence 
coefficient,  is  realized  with  the  same  method,  splitting  the  outer  loop  and  using 
the  PDGEMR2D  routine  for  the  data  repartition. 


3.2  Influence  matrix  resolution 

This  part  mainly  consists  in  the  resolution  of  a  linear  algebra  system  (Part  P2). 
In  the  sequential  version  of  Thesee  it  is  carried  out  with  a  simple  call  of  the 
DGESV  routine  from  LAPACK  [1].  DGESV  solves  a  linear  equation  system 
with  a  LU  factorization  and  then  a  back-solve  substitution. 

DGESV. (&n,  fenrhs,  feMfl] ,  ftlda,  ftipivfl]  ,  &B[1],  &ldb,  feinfo) ; 

In  this  procedure  DGESV  solves  the  M  *X  =  B  system  of  equations  and  stores 
the  results  in  the  B  vector. 

The  parallel  solve  routine  PDGESViiom  ScaLAPACK  provides  the  same  sys¬ 
tem  resolution  with  a  parallel  LU  decomposition  on  block  cy'clically  distributed 
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data  on  a  virtual  grid  of  processors.  A  redistribution  of  the  data  on  each  proces¬ 
sor  is  then  needed  to  use  the  parallelized  resolution. 

The  main  idea  is  that  each  processor  involved  in  the  resolution  holds  a  sub¬ 
matrix  of  the  matrix  M  to  solve.  The  processors  of  the  parallel  machine  with 
P  processors  are  presented  to  the  user  as  a  linear  array  of  process  IDs,  lal^eled 
0  through  (P-1).  It  is  often  more  convenient  while  doing  matrix  computations 
to  map  this  1-D  array  of  P  processes  into  a  logical  two  dimensional  process 
mesh,  or  grid  while  doing  matrix  computation.  This  grid  will  have  R  processor 
rows  and  C  processor  columns,  where  R*  C  =  G  <=  P.  A  processor  can  now 
be  referenced  by  its  coordinates  within  the  grid  (indicated  by  the  notation  I,  j, 
where  0  <=  i  <  i?,  and  0  <=  j  <  C).  An  example  of  such  a  mapping  is  shown 
in  Figure  5. 


Proc  (0,0) 

Proc  (0,1) 

Proc  (1.0) 

Proc  (1.1) 

M  B 


Fig.  5.  Example  of  redistribution  with  a  2  x  2  grid 


.K  processor  can  be  a  member  of  several  overlapping  or  disjoint  virtual  grids 
during  the  computation,  each  one  identified  by  a  context. 

The  .ScaLAPACK  library  uses  a  block  cyclic  data  distribution  on  a  virtual 
grid  of  processors  in  order  to  reach  a  good  load-balance,  good  computation 
efficiency  on  arrays,  and  an  equal  memory  usage  between  processors.  The  load- 
balance  is  insure  by  the  cyclic  distribution  that  gives  to  each  processor  matrix 
elements  that  are  comming  from  “different"'  locations  of  the  matrix  (compared  to 
a  classical  full  block  decomposition).  The  communication  efficiency  is  obtained 
because  in  a  cyclic  distribution,  the  row  and  column  shape  of  the  matrix  is 
preserved,  so  most  of  communication  of  ID  arrays  can  happen  without  a  complex 
index  computation  (see  [4.  5, 13,  2]  among  others).  We  ran  tests  in  order  to  choose 
the  best  grid  shape  and  the  best  block  size  of  the  data  distribution  for  our 
problem  on  each  of  the  platform. 

Matrices  and  arrays  are  then  wrapped  by  blocks  in  all  dimensions  correspond¬ 
ing  to  the  processor  grid  using  the  PDGEMR2D  routine.  Figure  3.2  illustrates 
the  organization  of  the  block  cyclic  distribution  of  a  2D  array  on  a  2D  grid  of 
proces.sors. 

In  the  parallel  version  of  .  we  used  the  following  parameters  for  the  data 
distribution  of  the  LU  factorization;  the  block  size  is  32  x  32  and  the  processor 
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Fig.  6.  The  block  cyclic  data  distribution  of  a  2D  array  on  a  2  x  3  grid  of  processors. 


grid  shape  is  a  ID-grid  that  gave  the  best  overall  computation  timings  (for 
further  information  see[10]). 


Fig.  7.  Data  redistribution  inside  parallel  Thesee 


In  the  first  parts  of  parallel  ,  the  global  system  matri.x  (M)  is  hold  in  1  x  1 
grid  by  the  proc  0  (Context  1x1),  Then,  it  is  distributed  over  a  1  x  NhProc  grid 
(ContextlxN)  with  the  PDGEMR2D- .  The  same  operation  is  also  realized  for 
the  B  vector.  Then  each  processor  call  the  PDGESV  routine  from  ScaLAP.A.Cd': 
instead  of  DGESV  from  LAPACK.  After  this,  the  solution  vector  is  distributed 
on  the  local  sub-matrix  of  B.  A  new  call  to  PDGEMR2D  is  needed  to  gathei  the 
solution  sub-vectors  from  the  ContextlxN  grid  to  the  Context  1x1  grid. 


3.3  Speed  computation 

The  speed  computation  procedure  (Part  P.3)  is  made  of  loops  to  compute  the 
speed  array  (S)  and  the  potential  array  (Pot). 

The  seciuential  code  uses  two  nested  loops  for  the  speed  array  computation. 

-  Parallel  Double  GEneral  Matrix  Redistribution  (from  ScaL.A.PACK ) 
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for(i2=l  ;  i2<=K/2  ;  i2++)  { 
for(j2=l  ;  j2<=N+l  ;  j2++)  { 


S[(i2-l)*(N+l)+j2]=...; 

} 

} 


Fig.  8.  Sequential  code  of  the  speed  computation 


The  computation  of  an  element  of  the  speed  array  is  independent  from  all 
the  others,  giving  us  another  nice  embarrassingly  parallel  problem.  The  external 
loop  is  thus  split  like  in  Part  PI  and  each  processor  is  assigned  a  given  number 
of  strips  to  compute.  The  computation  code  of  the  speed  array  is  modified  as 
shown  in  Figure  9. 


/♦  number  of  the  first  strip  computed  by  the  proc  ♦/ 
ideb=ceil ( (MyPRow) * (K/2 ) / ( (NPRow) ) ) + 1 ; 

/*  speed  computation  loop  »/ 

for ( i2=ceil ( (MyPRow) * (K/2) / ( (NPRow) ) )+l  ; 

•  i2<=ceil((HyPRow+l)*(K/2)/(NPRow)) ; 
i2++) 

{ 

for(j2=l  ;  j2<=N+l  ;  j2++) 

{ 


S[(i2-ideb)*(N+l)+j2]= 

} 

} 


Fig.  9.  Parallelization  of  the  external  loop  of  the  Figure  8 


The  computation  of  the  potential  array  is  done  with  the  same  method. 


4  Performances  tests 

The  tests  were  run  on  2  different  platforms  with  3  different  networks:  an  Ether¬ 
net  and  ATM  network  of  SUN  Sparc  5  85MHz  with  Solaris  and  an  Ethernet  and 
Myrinet  network  of  Pentium-Pro  200MHz  running  Linux.  The  lower  level  com¬ 
putation  libraries  BLAS  were  an  optimized  version  on  the  SUNs  and  a  compiled 
version  on  the  Pentium-Pros. 

The  efficiency  of  the  fill  in  of  the  influence  matrix  and  of  computation  of  the 
speed  and  potential  arrays  are  roughly  the  same  on  every  configuration.  The  code 
for  these  parts  is  “embarrassingly  parallel’'  and  thus  the  speed-up  is  almost  equal 
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to  the  number  of  processor.  Whereais  the  LU  factorization  which  involves  a  lot 
of  communications  is  highly  dependent  of  the  network  software  and  hardware, 

4.1  Optimal  block  size 

There  is  only  slight  differences  between  the  different  block  sizes  performances 
on  such  small  platforms  with  our  problem  sizes.  However  a  block  size  of  32  x  32 
was  the  optimal  on  every  configuration  (i.e.  Ethernet,  ATM,  Myrinet). 

We  present  in  Table  10  the  timing  for  the  LU  resolution  with  a  system  1722  x 
1722  which  correspond  to  our  production  problem  size. 


Block  size 

I P 1  Ethernet^ 

Timin 
I  PI  AT 

2 

IP  -  BIP/Myriner 

8x8 

71.3 

67.7 

30.1 

16  X  16 

68.9 

61.3 

29.5 

32  X  32 

68.3 

61.1 

28.1 

64  X  64 

74.4 

66.8 

34.1 

*  Fig.  10.  Timings  of  the  LU  resolution  for  different  block  and  problem  sizes. 


The  results  show  the  advantage  of  the  new  Pentium- Pro200  generation  over 
the  rather  old  Sparc85  and  gives  the  better  block  size  for  each  configuration. 

4.2  Comparison  between  Ethernet  and  ATM 

We  present  here  the  timing  results  of  the  whole  computation  using  the  different 
platforms.  First,  we  compare  the  networks  with  the  same  processor  kind  (SUN 
Sparc  .5)  over  PVM. 

The  gain  obtained  with  ATM  is  small  because  the  startup  time  of  this  two 
network  is  similar  to  the  one  on  Ethernet.  This  is  the  big  part  of  the  communi¬ 
cation  delay.  However,  the  speedup  obtained  is  not  neglectable  while  using  the 
code  in  production  because  the  response  time  is  critical. 


4.3  Comparison  between  Ethernet  and  Myrinet 

We  present  here  the  timings  obtained  on  the  PC  platform.  The  Myrinet  network 
is  driven  by  the  BIP[11, 12]  (Basic  Interface  for  Parallelism)  software.  BIP  is  a 

‘  on  SUN  Sparc 
■  on  Pentium  Pro 
seciuential  version 
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System  size 

Timin 
IP/Ethernet  (s) 

gs 

IP/ATM  (s) 

4  Proc  902  x  902 

16.3 

1.3.5 

2  Proc  902  x  902 

18.8 

16.2 

1  Proc  902  X  902 

20.3‘' 

20.3-' 

4  Proc  1722  x  1722 

68.3 

61.1 

2  Proc  1722  x  1722 

108.9 

95.9 

1  Proc  1722  X  1722 

131.8'' 

131.8-' 

Fig.  11.  Timings  of  the  whole  computation. 


System  size 

Speed 

IP/Ethemet 

up 

IP/.4TM 

4  Proc  902  x  902 

1.24 

1.50 

2  Proc  902  x  902 

1.07 

1.25 

4  Proc  1722  x  1722 

1.92 

2.15 

2  Proc  1722  x  1722 

1.21 

1.37 

Fig.  12.  Speedup  of  the  whole  computation. 


new  protocol  that  provides  a  small  parallel  API  implemented  on  the  Myrinet 
network.  Other  protocol  layers  are  implemented  for  the  classical  interfaces.  BIP 
delivers  to  the  application  the  maximal  performance  achievable  by  the  hardware 
using  a  low  latency  zero  copy  mechanism.  An  IP-BIP  stack  has  been  build  on 
top  of  BIP.  ,\s  well,  a  port  of  MPI-CH  was  realized  with  MPI-BIP[15. 12,  7]. 
The  results  of  parallel  j^liave  been  measured  with  PVM  over  IP/Ethernet  and 
P^A^  over  IP-BIP  ./Myrinet  and  MPI  over  BIP/Myrinet.  This  gives  an  idea  of 
the  portability  of  our  code  that  uses  library  calls  that  are  available  in  IP,  PVM. 
MPI,  . . . 


System  size 

IP/Ethemet  (s) 

Timings 

IP-BIP /Myrinet  (s) 

MPI-BIP /Myrinet  (s) 

4  Proc  902  x  902 

10.2 

4.7 

3.1 

2  Proc  902  x  902 

9.6 

6.7 

5.0 

1  Proc  902  X  902 

lo.o-^' 

10,0 

lO.O" 

4  Proc  1722  x  1722 

45.9 

■28.1 

21.3 

2  Proc  1722  x  1722 

56.4 

44.3 

38.1 

1  Proc  1722  X  1722 

87.4^' 

87.4  ^ 

87.4^' 

Fig.  13.  Timings  with  Myrinet. 
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System  size 

IP.  Ethernet 

Speedup 

BIP-IP/Myrinet 

MPI-BIP/Myrinet 

4  Proc  902  x  902 

0.97 

2.10 

3.24 

2  Proc  902  x  902 

1.03 

1.49 

1.97 

4  Proc  1722  x  1722 

1.90 

3.10 

4.09 

2  Proc  1722  x  1722 

1.54 

1.97 

2.29 

Fig.  14.  Speedup  with  Myrinet. 


First  notice  the  advantage  of  the  Pentium-Pro  speed  against  the  Sparc  on 

sequential  numbers.  ,  -j 

Not  surprisingly  the  best  results  are  achieved  with  MPI-BIP  that  provides 
a  very  low  9/j,s  latency  for  the  basic  send  communication.  The  gain  on  large 
problem  size  and  platform  is  more  than  50%  over  the  Ethernet  run,  leading  to 

a  super-linear  speed-up.  ,  ■  r  i 

This  can  be  explained  by  a  better  cache  hit  ratio  in  the  parallel  version  of  the 
code.  As  the  matrix  is  distributed  cyclically  on  the  processors,  the  computation 
occurs  on  blocked  data  that  fits  better  in  the  cache  during  the  LU  decomposition, 
leading  to  a  better  use  of  the  processor’s  pipeline  units.  Moreover,  an  overlap  of 
the  communications  is  done  in  the  parallel  LU  decomposition,  this  overlap  can 
be  (relatively)  increased  because  the  total  amount  of  the  communication  cost  is 
greatly  reduced  with  the  Myrinet-j-BIP  platform. 

These  outstanding  results  show  that  low-level  access  to  high-speed  network 
is  es.sential  to  achieve  the  best  possible  performances  while  doing  parallel  com¬ 
putation. 

4.4  Industrial  use  of  the  code 

We  ran  the  code  on  an  industrial  version  of  the  PC-Myrinet  cluster,  the  POPC 
(Pile  of  PCs)  machine  designed  by  Matra.  This  architecture  is  a  little  bit  slower 
than  the  original  test-bed  but  gives  interesting  results  about  the  scalability  of 
the  code.  When  using  the  compiled  BLAS  kernels,  depending  of  the  data  size 
of  the  problem,  there  is  little  interest  in  going  further  than  6  processors.  When 
using  an  optimized  version  of  the  BLAS  designed  for  the  Pentium  processors,  the 
timings  dropped  down  by  a  factor  of  more  than  2  for  the  small  wing  and  more 
than  3  for  the  big  one  and  there  is  little  gain  going  for  more  than  4  processors 
with  the  small  wing  and  more  than  8  processors  for  the  big  one.  From  a  user 
point  of  view,  the  elapse  time  was  be  decreased  by  a  factor  of  25  (from  more 
than  a  minute  to  5  second).  This  almost  an  immediate  answer  will  drastically 
improve  the  production  iterative  process  of  the  new  wing  shapes  for  the  cost  of 
a  few  PCs,  the  use  of  a  good  BLAS  kernel,  and  a  very  simple  parallelization 
method. 

These  outstanding  results  show  that  low^-level  access  to  high-speed  network 
is  essential  to  achieve  the  best  po.ssible  performances  while  doing  parallel  com¬ 
putation. 
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System  size 

Number  of  processors 

1 

2 

4 

6 

8 

10 

wing  902 

8.97 

5.11 

3.11 

2.36 

2.12 

1.79 

wing  902  (optimized  BLAS) 

4.199 

2.71 

1.88 

1.59 

1..39 

1.31 

wing  1722 

75.72 

38.17 

21.07 

14.11 

12.63 

wing  1722  (optimized  BLAS) 

25.337 

14.6 

8.09 

6.32 

5.41 

5.10 

Fig.  15.  Timings  of  the  whole  execution  on  the  POPC  Pile  of  PC'  machine 


5  Conclusion 

We  described  our  work  on  the  parallelization  of  an  air  flow  3D  simulation  that 
use  the  singularity  method  that  is  well  suited  for  low  speed  airflows. 

We  presented  a  very  easy  and  clean  way  to  parallelized  such  numerical  code, 
using  only  parallel  library  routines  (this  requires  the  sequential  code  to  be  written 
with  sequential  library  routines  too),  loop  splitting  and  calls  to  a  data  redistri¬ 
bution  routine.  The  parallel  code  is  thus  portable  (we  ran  it  on  3  different  .  IP. 
P\'M.  MPI)  and  efficient  (super-linear  speedup  over  Myrinet). 

We  demonstrate  that  low'  cost  parallel  hardware  and  good  software  can  lead 
to  significant  improvement  for  production  codes,  starting  from  a  more  than  2 
minutes  delay  and  going  to  21s  on  a  four  PCs  platform  with  Myrinet. 

Our  future  work  will  consist  in  the  automatization  of  this  parallelization 
process  of  numerical  code  using  library  routines  with  a  software  tool. 
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Abstract.  A  2D  Fourier-Chebyshev  pseudo-spectral  method  for  the  full 
Navier-Stokes  equations  with  a  dynamical  domain  decomposition  tech¬ 
nique  has  been  parallehzed  on  a  SPMD  machine,  a  CRAY-T3E.  The 
parallelism  is  grounded  on  the  distribution  of  the  data  among  processors 
and  on  the  simultaneous  computations  on  each  subdomain.  The  SHMEM 
paradigm  is  used  for  communication  between  processors.  Comparisons 
between  the  vectorial  and  the  parallel  versions  of  spectral  derivatives 
with  Fourier  and  Chebyshev  expansions  are  presented.  Performances 
versus  the  number  of  processors  and  collocation  points  are  also  stud¬ 
ied.  Validation  of  the  code  has  been  performed  by  simulating  a  sub¬ 
sonic  Kelvin-Helmholtz  instability  and  comparing  with  results  obtained 
on  vectorial  machines.  Performances  for  two  different  configurations,  the 
Kelvin-Helmholtz  and  Rayleigh- Taylor  flows,  are  also  presented. 


1  Numerical  Method 

In  order  to  study  transition  to  turbulence,  very  accurate  numerical  schemes 
and  large  resolution  are  required  [2].  With  this  aim  in  view,  a  sophisticated  2D 
dynamical  multidomain  pseudo-spectral  code  has  been  developed  to  simulate 
viscous  compressible  flows,  and  especially  the  Kelvin-Helmholtz  and  Rayleigh- 
Taylor  instabilities.  The  numerical  method  solves  the  full  2D  Navier-Stokes  equa¬ 
tions.  It  uses  a  Fourier-Chebyshev  expansion.  The  features  of  our  method  are  as 
follows  [1]: 

1.  Time  marching  is  done  with  a  semi-implicit  third  order  Runge-Kutta  scheme 
in  a  low-storage  formulation.  The  advective  terms  are  treated  explicitly  and 
all  diffusion  terms  are  handled  implicitly.  Since  transport  coefficients  aie 
constant,  the  implicit  stage  is  performed  in  the  Fourier  space  by  means  of 
a  Chebyshev  iterative  scheme.  This  procedure  allows  us  to  use  larger  time 
steps. 

2.  The  vertical  direction  is  decomposed  into  non-overlapping  subdomains.  Den¬ 
sity  is  matched  with  a  simple  upwind  procedure.  Velocities  and  temperature 
are  handled  with  the  influence  matrix  method  which  reflects  the  continuity 
of  the  function  and  its  first  normal  derivative  at  the  interface. 
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3.  In  each  subdomain,  a  self-adaptive  transformation  of  a  coordinate  is  used. 
Since  strong  gradients  may  occur  in  the  middle  of  the  subdomain,  a  transfor¬ 
mation  of  coordinate  is  used  to  bring  the  mesh  points  in  the  vicinity  of  the 
gradient.  These  mappings  and  the  location  of  the  interfaces  between  subdo¬ 
mains  need  to  be  self-adaptive  because  gradients  move  in  time.  Indeed,  this 
is  an  interesting  feature  of  this  numerical  method  to  automatically  optimize 
the  locations  of  the  interfaces  by  minimizing  the  H'~  norm. 


2  The  Parallelization  Method 

At  first,  the  numerical  code  was  running  on  a  vectorial  computer,  a  CRAY-YMP. 
Because  the  use  of  this  code  was  limited  by  the  memory  size  of  the  computer 
(the  resolution  was  also  restricted)  and  the  time  required  for  a  whole  simula¬ 
tion,  the  unique  solution  was  to  develop  a  parallel  version  for  2D  and  later  3D 
calculations.  In  our  case,  the  parallelization  began  on  a  CRAY-T3D,  and  then 
finished  on  a  CRAY-T3E.  This  type  of  machines  offers  two  advantages  :  it  al¬ 
lows  us  to  increase  the  total  available  memory  and  simultaneously  to  decrease 
the  cost  in  CPU  time.  For  example,  the  maximum  allocated  memory  for  an  user 
is  168  *  16  Mwords  on  a  CRAY-T3E  with  168  processor  elements,  whereas  only 
512  Mwords  are  available  on  a  CRAY-T90  (with  24  processors,  and  sometimes 
24  users  together!).  This  new  version  permits  us  to  execute  more  voluminous 
calculations,  that  means  to  increase  precision  and  also  resolution,  he.,  the  total 
number  of  mesh  points.  The  CRAY-T3E  allows  only  SIMD  programming  (Single 
Program  Multiple  Data),  that  means  that  a  single  program  is  forked  on  every 
processor  element  (PE),  which  computes  with  its  own  data. 


2.1  Domain  Decomposition 

The  numerical  method  was  particularly  adapted  to  parallelism,  because  it  uses 
a  domain  decomposition  method.  Since  the  physical  domain  is  divided  into  a 
little  number  of  subdomains  (typically  3  to  9)  in  one  direction  (the  vertical  c- 
direction),  the  parallelization  procedure  concerns  the  distribution  of  the  subdo¬ 
mains  and  their  physical  associated  quantities  (velocities,  density,  temperature, 
pressure,  energy  ...)  on  groups  of  processors.  Subdomains  are  shared  out  among 
proce.ssors.  So,  each  processor  is  assigned  to  a  fixed  subdomain. 

We  can  represent  each  physical  quantity  in  a  form  of  a  2D  matrix  and  sup¬ 
pose  that  columns  contain  all  i’-data  at  fixed  c-value.  and  rows  all  r-data  at 
fixed  a’-value.  This  representation  leads  us  to  distribute  all  columns  of  a  global 
matrix  on  group  of  processors.  So,  each  processor  contains  a  little  number  of 
whole  columns.  In  the  code.  Fast  Fourier  Transform  (FFT)  is  performed  in  the 
.I'-direction  (.r-derivative).  It  can  be  easily  parallelized  ;  each  processor  performs 
a  few  FFTs.  while  all  proces.sors  are  running  in  parallel.  Serial  computations 
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are  transformed  into  parallel  computations.  Another  current  operation  is  the  r- 
derivative  through  a  matrix-matrix  product.  Because  processors  do  not  possess 
in  their  local  memory  the  data  needed  for  this  product  or  any  other  global  opera¬ 
tion,  data  transfer  is  required,  but  it  suffers  from  too  much  communication  cost. 
The  matrix-matrix  product  is  mostly  used  in  the  code  kernel,  that  is  the  reason 
why  we  tried  to  minimize  this  cost,  by  searching  the  fastest  way  to  transpose  a 
distributed  matrix,  for  instance. 


2.2  Strategy  for  the  Parallelization 

Choice  between  PVM  and  SHMEM  The  numerical  method  requires  data 
transfer  between  processors,  i.e.,  communication,  for  matrix  operations,  such  as 
matrix  transposition  or  the  research  of  a  global  matrix  minimum  or  maximum, 
and  for  matching  physical  quantities  at  the  domain  interfaces.  To  minimize  over¬ 
head  time  due  to  communication,  we  looked  into  the  use  of  two  paradigms,  PVM 
and  SHMEM.  Two  criteria  were  taken  into  account  :  portability  and  rapidity. 

First,  PVM  (Parallel  Virtual  Machine),  a  message  passing  library,  was  used 
because  of  its  portability.  After  a  series  of  tests  that  revealed  lack  of  efficiency,  we 
added  SHMEM  (Shared  Memory  Access  Library)  instructions  with  the  object  to 
decrease  execution  time.  The  SHMEM  routines  are  data  passing  library  routines, 
which  can  be  used  as  a  replacement  for  message  passing  routines.  Bandwidth  is 
higher  with  SHMEM  and  latency  time  is  lower.  In  other  words,  that  means  that 
a  larger  number  of  elements  is  transfered  between  processors  within  a  shorter 
time.  To  be  precise,  latency  time  is  only  1.85//S  on  a  CRAY-T3D  and  1.75/rs  on  a 
CRAY-T3E  for  the  shmem-put,  against  respectively  34/rs  and  12iis  for  PVM  on 
the  CRAY-T3D  and  T3E.  Bandwidth  is  about  120  MBytes/s  for  SHMEM  with¬ 
out  stream  buffers  and  only  26  MBytes/s  for  PVM.  SHMEM  allows  to  transfer  5 
times  more  elements  in  a  time  6  times  shorter.  In  conclusion,  these  routines  min¬ 
imize  the  overhead  associated  with  data  passing  requests,  maximize  bandwidth 
and  minimize  data  latency. 

We  compared  execution  times  for  a  special  problem  that  is  characteristic  in 
our  resolution  method  :  the  matrix  transposition.  In  the  code  kernel,  a  FFT  is 
executed  first  on  each  processor.  Then  the  global  matrix  has  to  be  transposed, 
before  the  z-derivative  is  computed  through  a  matrix-matrix  product.  This  ma¬ 
trix  transposition  requires  data  transfers.  Because  the  vast  majority  of  the  time 
is  spent  in  this  type  of  operations,  total  execution  times  for  a  FFT,  a  matrix 
transposition  and  a  matrix-matrix  product  have  been  compared  with  both  of 
paradigms,  PVM  and  SHMEM. 

In  Fig.l,  comparison  between  times  ratio  spent  in  the  code  kernel  using  PVM 
or  SHMEM  shows  clearly  that  SHMEM  is  the  fastest  :  by  using  the  shmem-iput 
or  shmem-ixput  (variants  of  shmem-iput),  the  time  on  the  CRAY-T3D  is  at 
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Fig.l.  Ratios  of  CPU  time  YMP/T3D  vs.  the  paradigm  used,  PVM  or  SHMEM. 


least  6  times  shorter  than  with  PVM. 

Finally,  we  chose  to  use  only  SHMEM  because  of  its  efficiency.  A  next  version 
of  MPI-2,  a  Message  Passing  Interface,  which  becomes  a  standard,  will  include 
SHMEM  instructions  and  make  easier  portability.  So,  in  the  near  future,  both 
of  important  criteria  will  be  respected. 


Difficulties  due  to  the  Implementation  of  SHMEM  SHMEM  is  a  low' 
level  library.  As  a  result,  it  is  more  efficient  than  PVM.  But  it  is  more  difficult 
to  implement  SHMEM  in  a  code.  Parallelization  of  a  code  with  SHMEM  requires 
to  get  in  some  hardw'are  features.  During  the  parallelization,  the  programmer 
has  to  be  very  careful  because  of  the  followdng  reasons. 

On  one  hand,  the  data  to  transfer  have  usually  to  be  symmetric,  that  means 
statically  in  memory,  and  more  precisely,  an  object  has  to  be  associated  to  the 
same  address  on  every  PE.  This  is  possible  by  using  a  common  storage  area,  or 
using  special  directives  like  !CDIR$  SYMMETRIC. 

A  data  object  is  called  .symmetric,  if  its  local  and  remote  addresses  have  a  knowm 
relationship.  This  is  one  of  the  reasons  why  one  has  to  be  very  careful  with  the 
use  of  dynamically  allocated  data  objects. 
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The  slunem-put  and  shmem-get  instructions  copy  data  from  memory  directly 
into  memory  ‘  with  the  stunein“pu‘t,  data  are  copied  from  the  local  processor 
memory  into  the  remote  processor  memory,  without  advertising  the  remote  pro¬ 
cessor.  The  shiueiu—get  routine  copies  data  from  the  remote  processor  memory 
into  its  local  memory. 

The  shmem-put  call  presents  some  disadvantages:  on  the  CRAY-T3D,  cache  co¬ 
herency  is  only  ensured  by  flushing  the  cache,  because  the  shmem-put  routine 
brings  up  to  date  only  central  memory  and  not  the  cache  of  the  remote  proces¬ 
sor.  That  is  why  flushing  the  cache  is  very  important;  otherwise,  on  the  remote 
processor,  cache  and  memory  could  contain  two  different  values  for  the  same 
data!  However  cache  coherency  is  guaranteed  on  the  CRAY-T3E. 

On  the  other  hand,  order  of  data  transfered  by  a  shmem-put  is  ensured  only 
by  calling  the  shmem-fence  routine.  Furthermore  the  shmem-put  function  re¬ 
turns  before  the  end  of  transfer  and  poses  the  problem  of  asynchronicity.  More 
synchronization  between  processors  has  to  be  explicit  and  is  placed  upon  the 
programmer  through  shmem-barrier  routines. 


All  these  reasons  lead  us  to  use  only  the  shmem-get  routine,  instead  of  the 
shmem-put,  with  the  intention  of  avoiding  some  of  these  constraints.  Whereas 
the  shmem-put  command  is  faster  than  the  shmem-get  one  on  the  CRAY-T3D, 
performances  of  both  are  similar  on  the  CRAY-T3E. 

In  addition  to  this,  some  other  collective  instructions  such  as  shmem-broadcast 
have  been  also  utilized.  With  the  aim  to  minimize  communication  time,  another 
solution  is  to  use  PBLAS,  a  parallel  library,  which  optimizes  a  lot  of  operations, 
like  matrix  and  vector  operations.  This  is  under  investigation. 

3  Tests  on  the  Code  Kernel 

In  a  full  typical  simulation,  most  of  the  time  is  spent  in  FFT  computations, 
matrix-matrix  products  and  matrix-vector  products.  To  show  the  efficiency  of 
the  parallel  version  with  regard  to  the  vectorial  one,  w'e  compare  the  execution 
times  for  the  code  kernel.  Such  a  kernel  is  defined,  on  a  vectorial  computer,  a.s 
a  derivative  in  the  x-direction  through  a  FFT  folloived  by  a.  derivative  in  the 
r-direction  with  a  matrix-matrix  product.  This  sequence  becomes  on  a  MPP 
machine  a  x-derivative  (with  a  FFT),  a  local  matrix  transposition  and  then  a 
z-derivative  (with  a  matrix-matrix  product).  We  applied  this  kind  of  calculation 
to  an  arbitrary  function  and  measured  the  total  execution  time.  Each  test  case 
has  been  run  1000  times  and  the  mean  value  has  been  calculated. 

We  carried  out  a  series  of  tests,  by  varying  the  number  of  mesh  points  in 
both  directions  on  one  hand,  and  the  number  of  processors  on  the  other  hand. 
The  figures  above  show  the  mean  time  ratio  between  the  CRAY-YMP/T90  and 
CRAY-T3D/T3E  times. 
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Fig.  2.  Ratios  of  CPU  time  YMP/T3D  and  T90/T3E  vs.  the  number  of  PEs.  The 
number  of  subdomains  is  equal  to  5.  The  resolution  is  50  r-points  in  eich  subdomsm 
and  128  x-points. 


Fig.  3.  Ratios  of  CPU  time  YMP/T3D  and  T90/T3E  vs.  tie  number  d  r-points.  The 
number  of  subdomains  is  equal  to  5.  The  resolution  is  256  s-points. 


Figure  2  shows  the  evolution  of  this  ratio  as  a  fuiction  of  the  number  of 
processors.  Figure  3  represents  the  influence  of  matrix  sizes  on  me  time  ra.tio. 
This  ratio  increases  with  the  number  of  r-points.  At  a  ixed  numler  of  r-poiur.s. 
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Fig.  4.  Performance  in  terms  of  GigaFlops  on  a  T3E  vs.  the  number  of  processors  for 
9  subdomains  and  51  s-points  in  each. 


Fig.  5.  Performance  in  terms  of  GigaFlops  on  a  T3E  vs.  the  number  of  ^-points.  The 
number  of  r-points  is  64,  128,  256,  512  and  1024  respectively. 


it  grows  up  with  the  number  of  .f-points.  Moreover,  we  measured  performance,  in 
temis  of  GigaFlops.  in  each  case.  In  this  case,  the  physical  domain  is  divided  into 
9  subdomains  with  51  Chebyshev  points  in  the  --direction.  Figure  4  represents 
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the  performance  obtained  on  the  CRAY-T3E  by  varying  the  number  of  PEs. 
One  can  remark  that  performance  increases  almost  linearly  with  the  number  of 
PEs  up  to  80  and  then  saturates,  probably  because  the  number  of  columns  in 
matrix-matrix  product  is  relatively  small,  for  such  a  number  of  PEs.  In  Fig.  5, 
the  curves  show  the  performance,  measured  in  GigaFlops,  versus  the  number 
of  c-points  (called  Ab).  Each  curve  corresponds  to  a  fixed  number  of  x-points 
(called  A>).  Performance  increases  with  A';,  but  not  regularly  with  A'.r,  and  then 
slightly  decreases:  the  performance  maximum  is  reached  for  Nx  —  512.  We  notice 
that  discrepancies  between  Nx  =  256,  512  and  1024  are  very  small. 

4  The  Validation  of  the  Parallelization 

As  first  validation  of  the  whole  code,  we  simulated  the  Kelvin-Helmholtz  flow. 
Its  basic  state,  written  in  a  non-dimensional  form  is: 

?;  =  ^t.anh(2c),  t>  =  0,T=  1  +  ^^Af-(1  -  (2u)-),  P  =  1  (1) 

with  0  <  .r  <  Lx  and  -L,:  <  -  <  A.-.  Neumann  boundary  conditions  are  applied 
to  the  horizontal  velocity  and  the  temperature  and  Dirichlet  conditions  to  the 
vertical  velocity. 


Fig.  6.  Evolution  in  time  of  the  vorticity  for  the  Kelvin-Helmholtz  flow.  The  simulation 
with  the  highest  resolution  is  slightly  different. 
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The  validation  of  the  parallelization  is  based  on  the  superposition  of  both 
results  of  the  vectorial  and  parallel  versions  for  a  Kelvin-Helmholtz  instability 
[1],  Indeed,  we  simulated,  on  both  types  of  machines,  the  Kelvin-Helmholtz  flow 
with  72  a;-points  and  155  r-points  for  3  subdomains  on  52  processors.  In  Fig. 
6.  curves  represent  the  evolution  in  time  of  the  vorticity.  A  simulation  with  a 
higher  resolution  with  512  x-points  and  255  s-points  with  5  subdomains  on  86 
processors  has  been  also  performed.  This  result  is  very  similar  to  those  obtained 
with  3  subdomains  and  valid  the  parallel  version. 


In  Table  1,  we  compare  the  Total  Execution  Time  (TET).  This  time  de¬ 
creases  as  the  number  of  processors  grows  up,  but  this  variation  is  not  linear. 
By  comparing  the  time  for  the  whole  simulation  on  CRAY-YMP  and  on  CRAA  - 
T3E  for  the  Kelvin-Helmholtz  instability,  we  notice  that  this  time  is  three  times 
shorter  on  the  parallel  machine.  The  efficiency  E  of  a  parallel  algorithm  for  a 
proble  instance  of  size  N  using  P  processors  is  defined  by  the  formula; 


E(P,N)  = 


TAN) 

P*TpiN) 


(2) 


where  Ti{N)  and  Tp(A’)  are  respectively  the  times  needed  for  one  processor 

and  P  processors.  a  i 

The  efficiency  obtained  with  the  whole  code  is  about  0.51  for  10  PEs.  At  this 
time,  the  190  -g  compiling  option  has  been  used  and  have  slowed  down  the  exe¬ 
cution.  This  can  explain  a  relative  bad  performance.  In  this  case,  communication 
time  is  not  negligible  with  respect  to  the  computational  time.  In  addition,  the 
number  of  mesh  points  is  too  small  to  reach  high  efficiency. 


T3E  10  PEs 

T3E  52  PEs 

YMP  1  PE 

T3E  1  PE 

Total  Execution  Time  in  s 

0.398  10’ 

0.183  10* 

0.474  10’ 

0.179  ur 

Time/node/cycle  (in  /is) 

148 

406 

Efficiency 

0.21 

Table  1.  Total  E.xecutioii  Time,  time  per  node  per  cycle  and  efficiency  for  vanou.s 
configurations  for  the  Kelvin-Helmholtz  flow. 


We  also  simulated  the  Rayleigh-Taylor  flow.  For  these  simulations.  128  .r- 
points  and  5  or  t  subdomains,  with  51  c-points  in  each,  were  used. 

Table  2  contains  the  times  obtained  for  various  configurations  for  the  Rayleigh- 
Taylor  instability.  For  5  subdomains  (128*255),  we  obtained  an  efficiency  eciual 
to  b.61,  bv  using  the  default  compiling  options.  Performances  are  much  better 
than  for  the  Kelvin-Helmholtz  flow.  The  first  reason  is  the  higher  number  of 
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T3E  86  PEs 
AT  =  128 

N,  =  255 

T3E  86  PEs 
AT  =512 

AT  =  255 

T3E  120  PEs 
AT  =  128 

AT  =  357 

YMP  1  PE 
=  128 
AT  =  255 

T3E  1  PE 
AT  =  128 
AT  =  255 

Time/ node/cycle  1 
(ill  /is) 

2085 

Efficiency 

0.61 

Table  2.  Total  Execution  Time,  time  per  node  per  cycle  and  efficiency  for  various 
configurations  for  the  Rayleigh-Taylor  flow. 


r-points  and  processors.  Because  a  greater  number  of  calculations  are  performed 
within  the  same  time,  the  TET  decreases.  As  second  reason,  we  can  involve  the 
optimization  of  the  parallelization  by  in-lining,  reducing  the  number  of  commu¬ 
nications,  of  arrays...  The  best  time  obtained  for  the  CRAY-T3E  decreases  by  a 
factor  of  10  in  comparison  with  the  CRAY-YMP. 

VVe  can  expect  better  performances  and  efficiency  by  using  other  compiling 
options  and  further  improvements  (for  instance  with  the  use  of  PBLAS). 


5  Conclusion 


We  have  obtained  some  results  with  a  parallel  version  of  a  ‘2D  pseudo-spectral 
code  using  a  dynamical  domain  decomposition  method.  It  solves  the  full  Navier- 
Stokes  equations.  The  elliptic  problems  coming  from  the  diffusive  terms  are 
solved  iteratively  in  the  Fourier  space. 

The  paradigm  used  for  the  parallelization  is  SHMEM,  because  of  its  effi¬ 
ciency,  in  comparison  with  PVM. 

The  validation  of  the  parallel  version  is  based,  for  the  Kelvin-Helmholtz  insta¬ 
bility.  on  the  good  superposition  of  results,  that  represent  the  evolution  in  time 
of  vorticity. 

The  best  efficiency  obtained  today  is  equal  to  0.61  and  the  best  time  per  node 
per  cycle  is  39.9/is.  These  performances  could  be  improved. 

This  work  is  a  successful  example  of  a  parallelized  pseudo-spectral  Fourier- 
Chebyshev  method  with  a  dynamical  domain  decomposition  technique.  We  are 
interested  in  simulation  with  very  high  resolution  and  are  applying  this  method 
to  the  Rayleigh-Taylor  instability.  This  sophisticated  numerical  method  coupled 
with  the  parallelization  will  allow  us  to  study  in  more  detail  interactions  between 
different  modes. [2] 
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Abstract.  In  this  paper  we  present  a  unified  approach  to  the  design  of 
different  parallel  block-Jacobi  methods  for  solving  the  Symmetric  Eigen- 
^•alue  Problem.  The  problem  can  be  solved  designing  a  logical  algorithm 
Ijy  considering  the  matrices  divided  into  square  blocks,  and  considering 
each  block  as  a  process.  Finally,  the  processes  of  the  logical  algonthm 
are  mapped  on  the  processors  to  obtain  an  algorithm  for  a  particular 
system.  Algorithms  designed  in  this  way  for  ring,  square  mesh  and  tri¬ 
angular  mesh  topologies  are  theoretically  compared. 


1  Introduction 

The  Symmetric  Eigenvalue  Problem  appears  in  many  applications  in  science  and 
engineering,  and  in  some  cases  the  problems  are  of  large  dimension  wdth  hig  i 
computational  cost,  therefore  it  might  be  better  to  solve  in  parallel. 

Different  approaches  can  be  utilized  to  solve  the  Symmetric  Eigenvalue  Prob¬ 
lem  on  multicomputers: 

-  The  initial  matrix  can  be  reduced  to  condensed  form  (tridiagonal)  and  then 
the  reduced  problem  solved.  This  is  the  approach  in  ScaLAP.4CK  [1]. 

-  A  Jacobi  method  can  be  used  taking  advantage  of  the  high  level  of  parallelism 
of  the  method  to  obtain  high  performance  on  multicomputers.  In  addition, 
the  design  of  block  methods  allows  us  to  reduce  the  communications  and  to 
use  the  memory  hierarchy  better.  Different  block-Jacobi  methods  have  been 
designed  to  solve  the  Symmetric  Eigenvalue  Problem  or  related  problems  on 
multicomputers  [2,  3,  4,  5]. 

*  The  experiments  have  been  performed  on  the  512  node  Paragon  on  the  CSCC  parallel 
computer  system  operated  by  Caltech  on  behalf  of  the  Concurrent  Sui^ercomputmg 
Consortium  (access  to  this  facility  was  provided  by  the  PRISM  project).  ^ 

Partiallv  supported  b\'  Comision  Interministerial  de  Ciencia  y  Tecnologia.  project 
TIC96-1062-C03-02,  and  Coiisejeria  de  Cultura  y  Educacion  de  Murcia.  Diieccion 
General  de  Universidades.  project  COM-18/96  MAT. 

Partially  supported  by  Comision  Interministerial  de  Ciencia  y  Tecnologia,  project 
TIC96-1062-G03-01. 
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-  There  are  other  type  of  methods  in  which  high  performance  is  obtained 
because  most  of  the  computation  is  in  matrix-matrix  multiplications,  which 
can  be  optimised  both  in  shared  or  distributed  memory  multiprocessors. 
Methods  of  that  type  are  those  based  in  spectral  division  [b.  7,  8]  or  the 
't'au-Lu  method  [9]. 

In  this  paper  a  unified  approach  to  the  design  of  parallel  block- Jacobi  meth¬ 
ods  is  analized. 


2  A  sequential  block-Jacobi  method 

Jacobi  methods  work  by  constructing  a  matrix  sequence  {Ai]  by  means  of 
.4/4.1  =  QiAiQj  .  /  =  1, 2. . . .,  where  .4i  =  .4.  and  Qi  is  a  plane-rotation  that  an¬ 
nihilates  a  pair  of  nondiagonal  elements  of  matrix  .4/ .  A  cyclic  method  works  by 
making  successive  sweeps  until  some  convergence  criterion  is  fulfilled.  A  sweep 
consists  of  successively  nullifying  the  n{n  —  l)/2  nondiagonal  elements  in  the 
lower-triangular  part  of  the  matrix.  The  different  ways  of  choosing  pairs  {i.j) 
have  given  rise  to  different  versions  of  the  method.  The  odd-even  order  will  be 
used,  because  it  simplifies  a  block  based  implementation  of  the  sequential  al¬ 
gorithm.  and  allows  parallelization.  With  n  =  8.  numbering  indices  from  1  to  8. 
and  initially  grouping  the  indices  in  pairs  {(1, 2),  (3,  4),  (5, 6),  (7, 8)},  the  sets  of 
pairs  of  indices  are  obtained  as  follows: 

fr  =  l{(l,2),(3.4),(5.6),(7.8)} 
k^-2  {2, (1.4), (3.6), (5.8), 7} 
fc  =  3{(2,4),(L6),(3,8),(5.7)} 

A-  =  4  {4,  (2,6),  (1,8),  (3,  7).  5} 

A  =  5{(4,6),(2,8),(1,7),(3,5)} 

A  =  6  {6,(4, 8), (2,7). (1,5), 3} 

A  =  7{(6,8),(4,7),(2,5),(1,3)} 

A  =  8  {8,  (6,  7),  (4,  5),  (2, 3),  1} 

When  the  method  converges  we  have  D  =  QgQg-i  ■  ■  -  QiAQ'^ 
and  the  eigenvalues  are  the  diagonal  elements  of  matrix  D  and  the  eigenvectors 
are  the  rows  of  the  product  QkQk-i  ■  ■  Qi- 

The  method  works  over  the  matrix  A  and  a  matrix  V  where  the  rotations  are 
accumulated.  Matrix  is  initially  the  identity  matrix.  To  obtain  an  algorithm 
working  by  blocks  both  matrices  .4  and  V'  are  divided  into  columns  and  rows  of 
square  blocks  of  size  s  x  s.  These  blocks  are  grouped  to  obtain  bigger  blocks  of 
size  2sk  x  ‘2sk. 

The  scheme  of  an  algorithm  by  blocks  is  shown  in  figure  1. 

In  each  block  the  algorithm  works  by  making  a  sweep  over  the  elements  in 
the  block.  Blocks  corresponding  to  the  first  Jacobi  set  are  considered  to  have 
size  2s  X  2s,  adding  to  each  block  the  two  adjacent  diagonal  blocks.  .4  sweep  is 
performed  covering  all  elements  in  these  blocks  and  accumulating  the  rotations 
to  form  a  matrix  Q  of  size  2s  x  2s.  Finally,  the  corresponding  columns  and  rows 
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WHILE  convergence  not  reached  DO 

FOR  every  pair  (i,j}  of  indices  in  a  sweep  DO 

perform  a  sweep  on  the  block  of  size  25  x  25  formed 
by  the  blocks  of  size  s  x  5,  Ai, ,  A,j  and  Aj^ , 
accumulating  the  rotations  on  a  matrix  Q  of  size  2s  x  25 
update  matrices  A  and  V  performing  matrix-matrix 
multiplications 

ENDFOR 

ENDWHILE 


Fig.  1.  Basic  block-Jacobi  iteration. 


of  blocks  of  size  2s  x  2s  of  matrix  .4  and  the  rows  of  blocks  of  matrix  V'"  are 
updated  using  Q. 

After  completing  a  set  of  blocked  rotations,  a  swap  of  column  and  row  blocks 
is  performed.  This  brings  the  next  blocks  of  size  s  x  s  to  be  zeroed  to  the 
subdiagonal,  and  the  process  continues  nullifying  elements  on  the  subdiagonal 
blocks. 

Because  the  sweeps  over  each  block  are  performed  using  level-1  BLAS,  and 
matrices  A  and  V  can  be  updated  using  level-3  BLAS,  the  cost  of  the  algorithm 
is: 


Sksn^  -1-  (12A:i  -  16b)  n's  -b  Sbns'  flops,  (1) 

when  computing  eigenvalues  and  eigenvectors.  In  this  formula  A’l  and  As  rep¬ 
resent  the  execution  time  to  perform  a  floating  point  operation  using  level- 1  or 
level-3  BLAS,  respectively. 


3  A  logical  parallel  block-Jacobi  method 

To  design  a  parallel  algorithm,  what  we  must  do  first  is  to  decide  the  distribu¬ 
tion  of  data  to  the  processors.  This  distribution  and  the  movement  of  data  in 
the  matrices  determine  the  necessities  of  memory  and  data  transference  on  a 
distributed  system.  We  begin  analyzing  these  necessities  considering  processes 
but  not  processors,  obtaining  a  logical  parallel  algorithm. 

Each  one  of  the  blocks  of  size  2sk  x  2sA  is  considered  as  a  process  and  wdl  have 
a  particular  necessity  of  memory.  At  least  it  needs  memory  to  store  the  initial 
blocks,  but  some  additional  memory  is  necessary  to  store  data  in  sucessive  steps 
of  the  execution. 

A  scheme  of  the  method  is  shown  in  figure  2.  The  method  using  this  scheme 
is  briefly  explained  below. 
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On  each  process: 

WHILE  convergence  not  reached  DO 

FOR  every  Jacobi  set  in  a  sweep  DO 

perforin  sweeps  on  the  blocks  of  size  2s  x  2s 

corresponding  to  indices  associated  to  the  processor, 
accumulating  the  rotations 
broadcast  the  rotation  matrices 
update  the  part  of  matrices  A  and  V’  associated 

to  the  process,  performing  matrix-matrix  multiplications 
transfer  rows  euid  columns  of  blocks  of  A  and  rows 
of  blocks  of  V 

ENDFOR 

ENDWHILE 


Fig.  2.  Basic  parallel  block-Jacobi  iteration. 


Each  sweep  is  divided  into  a  number  of  steps  corresponding  each  step  to  a 
Jacobi  set. 

For  each  Jacobi  set  the  rotations  matrices  can  be  computed  in  parallel,  but 
working  only  processes  associated  to  blocks  2sk  x  2sk  of  the  main  diagonal  of 
.4.  On  these  processes  a  sweep  is  performed  on  each  one  of  the  blocks  it  contains 
corresponding  to  the  Jacobi  set  in  use,  and  the  rotations  on  ea.ch  block  are 
accumulated  on  a  rotations  matrix  of  size  2s  x  2s. 

After  the  computation  of  the  rotations  matrices,  they  are  sent  to  the  other 
processes  corresponding  to  blocks  in  the  same  row  and  column  in  the  matrix  A. 
and  the  same  row  in  the  matrix  V.  And  then  the  processes  can  update  the  part 
of  .4  or  V''  they  contain. 

In  order  to  obtain  the  new  grouping  of  data  according  to  the  next  Jacobi  set 
it  is  necessary  to  perform  a.  movement  of  data  in  the  matrix,  and  that  implies 
a  data  transference  between  processes  and  additional  necessities  of  memory.  It 
is  illustrated  in  figure  3,  where  j  =  the  rows  and  columns  of  blocks  are 
numbered  from  0  to  15,  and  the  occupation  of  memory  on  an  odd  and  an  e\'en 
step  is  shown.  In  the  figure  blocks  containing  data  are  marked  with  x. 

Thus,  to  store  a  block  of  matrix  .4  it  is  necessary  to  reserve  a  memory  of  size 
{2sk  +  s)  X  (2sk  +  s),  and  to  store  a  block  of  V'  is  necessary  a  memory  of  size 
{2sk  +  .s)  X  2sk. 


4  Parallel  block-Jacobi  methods 

To  obtain  parallel  algorithms  it  is  necessary  to  assign  each  logical  process  to  a 
jDrocessor,  and  this  assignation  must  be  in  such  a  way  that  the  work  is  balanced. 
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b) 


Fig.  3.  Storage  of  data;  a)  on  an  odd  step,  b)  on  an  even  step. 


The  mo.st  co.stly  part  of  the  algorithm  is  the  updating  of  matrices  .4  and  V  (that 
produces  the  cost  of  order  O  ( n^) )  ■  To  assign  the  data  in  a  balanced  wa\  it  suffices 
to  balance  the  updating  of  the  matrices  only.  In  the  updating  of  non-diagonal 
blocks  of  matrix  A,  x  ^  data  are  updated  pre-  and  post-multiplymg  by 
rotations  matrices.  In'the  updating  of  diagonal  blocks  only  elements  in  the  lower 
triangular  part  of  the  matrix  are  updated  pre-  and  post-multiplying  by  rotations 
matrices.  .4nd  in  the  updating  of  matrix  V,  2^  x  data  are  updated  but  only 
pre-multiplying  by  rotations  matrices.  So.  we  can  see  the  volume  of  computat  ion 
on  the  updating  of  matrices  on  processes  corresponding  to  a  non-diagonal  block 
of  matrix  .4  is  twice  that  of  processes  corresponding  to  a  block  of  1 '  or  a  diagonal 
block  of  .4.  This  must  be  had  in  mind  when  designing  parallel  algorithms. 

An  algorithm  for  a.  ring  Considering  q  =  and  a  ring  with  V  —  i 

cessors.  Pn.  Pi . Pr-i  ■  a  balanced  algorithm  can  be  obtained  assigning  to  each 

processor  P,-.  rows  i  and  9  —  1  —  /  of  matrices  .4  and  .  So.  each  processoi  P 
contains  blocks  .4,,.  with  0  <  j  <  ?.  Ag-i-ij,  with  0<j  and  1  ,j 
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A  V- 


Fig.  4.  Storage  of  data:  a)  on  an  odd  step,  b)  on  an  even  step.  Algorithm  for  ring 
topology. 


and  with  0  <  j  <  q- 

To  save  memory  and  to  improve  computation  some  of  the  processes  in  the 
logical  method  can  be  grouped,  obtaining  on  each  processor  four  logical  pro¬ 
cesses,  corresponding  to  the  four  rows  of  blocks  of  matrices  .4  and  V  contained 
in  the  processor.  In  figure  4  the  distribution  of  data  and  the  memory  re.served 
are  shown,  for  p  =  2  and  7  =  16.  In  that  way.  [2sk  +  s){2n  +  2sk  -p'ls)  possitioiis 
of  memory  are  reserved  on  each  processor. 

The  arithmetic  cost  per  sweep  when  computing  eigenvalues  and  eigenvectors 
is: 


?7  ??  **  S  77  5 

8A'3 - (-  {12ki  —  8^:3)  — - h  12ki -  flops. 

p  p  p 


(2) 


.And  the  cost  per  sweep  of  the  communications  is: 
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A  V 


Fig.  5.  Folding  the  matrices  in  a  square  mesh  topology. 


d(p  +  ^)^  +  r  ySn- +  2ns - ,  (3) 

where  f3  represents  the  start-up  time,  and  r  the  time  to  send  a  double  precision 
number. 

An  algorithm  for  a  square  mesh  In  a  square  mesh,  a  way  to  obtain  a.  bal¬ 
anced  distribution  of  the  work  is  to  fold  the  matrices  A  and  V  in  the  system  of 
processors,  such  as  is  shown  in  figure  5,  where  a  square  mesh  with  four  processors 
is  considered. 

To  processor  Pij  are  assigned  the  next  blocks;  from  matrix  A  block  .4yp+i,j. 
if  ?■  <  l-j  block  and  if  i  >  ^p- l-j  block  A^+io^r-i-J' 

from  matrix  V'  blocks  Vi/p+i,,,  Vyp_i-i.j,  1'Vp+«,2vF-i-.? 

The  memory  reserved  on  each  processor  in  the  main  antidiagonal  is 
(2sA’-|-s)(14sA--t-3s),  and  in  each  one  of  the  other  processors  (2sA-  +  s)(r2s^--t-2s). 

This  data  distribution  produces  an  inbalance  in  the  computation  of  the  ro¬ 
tations  matrices,  because  only  processors  in  the  main  antidiagonal  of  processors 
work  in  the  sweeps  over  blocks  in  the  diagonal  of  matrix  A.  On  the  other  hand, 
this  inbalance  allows  us  to  overlap  computations  and  communications. 

The  arithmetic  cost  per  sweep  when  computing  eigenvalues  and  eigenvectors 
is: 


11“  S  Tl  S" 

Ska—  -1-  ( I2ki  +  2k3)  —  +  12fci  —  flops. 
P  VP  VP 


And  the  cost  per  sweep  of  the  communications  is; 


(4) 


Comparing  equations  4  and  2  we  carr  see  the  arithmetic  cost  is  lower  irr  the 
algorithm  for  a  ring,  but  only  in  the  terms  of  lower  order.  Furthermore,  commu¬ 
nications  and  computations  can  be  overlapped  in  some  parts  of  the  algorithm 
for  a  mesh. 
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Fig.  6.  The  storage  of  matrices  in  a  triangular  mesh  topology. 


An  algorithm  for  a  triangular  mesh  In  a  triangular  mesh,  matrix  A  can 
be  assigned  to  the  processors  in  an  obvious  way,  and  matrix  V’  can  be  assigned 
folding  the  upper  triangular  part  of  the  matrix  over  the  lower  triangular  part 
(figure  6). 

To  processor  (f  >  j)  are  assigned  the  blocks  Aij ,  Vij  and  V},-.  The  memory 
reserved  on  each  processor  in  the  main  diagonal  is  (2sk  +  s)(4sk  +  s),  and  in 
each  one  of  the  other  processors  (2sk  +  s){5sk  +  s). 

Only  processors  in  the  main  diagonal  work  in  the  sweeps  over  blocks  in  the 
diagonal  of  matrix  A.  In  this  case,  as  happens  in  a  square  mesh,  the  inbalance 
allows  us  to  overlap  computations  and  communications. 

If  we  call  r  to  the  number  of  rows  and  columns  in  the  processors  system,  r 
and  p  are  related  by  the  formula  r  —  and  the  arithmetic  cost  per 

sweep  of  the  algorithm  w'hen  computing  eigenvalues  and  eigenvectors  is; 

q  '  n  o 

11 71*  S  71- S~ 

16/53—  +  (12/;i  +  2/13) - h  I2ki -  flops.  (6) 

7-  r  r 

And  the  cost  per  sweep  of  the  communications  is: 

d{r  +  1)”  +  ■'■  + 

The  value  of  r  is  a  little  less  than  \/2p.  Thus,  the  arithmetic  cost  of  this 
algorithm  is  worse  than  that  of  the  algorithm  for  a  square  mesh,  and  the  same 
happens  with  the  cost  of  communications.  But  when  p  increa.ses  the  arithmetic 
costs  tend  to  be  equal,  and  the  algorithm  for  triangular  mesh  is  better  than  the 
algorithm  for  a  square  mesh,  due  to  a  more  regular  distribution  of  data. 

5  Comparison 

Comparing  the  theoretical  costs  of  the  algorithms  studied  it  is  possible  to  con¬ 
clude  the  algorithm  for  a  ring  is  the  best  and  the  algorithm  for  a  triangular  mesh 


677.- 
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Table  1.  Theoretical  costs  per  sweep  of  the  different  parts  of  the  algorithms. 


comp,  rotations 

update  matrices 

broadcast 

trans.  data 

ring 

00 

1 

00 

12=3  +  12=4 

(P-1)7^ 

(24-24)r 

4^,4 

(6n^  4-  2ns)  r 

sq.  mesh 

+ 

00 

2n^  T 

Ajl3 

r*  t?* 

'  yp _ 

tr.  mesh 

164  -1-22^ 

12=3  +  12=3 

r^l3 

2n-T 

7^13 

^64  +  4ns^  T 

is  the  worst.  This  can  be  true  when  using  a  small  number  of  processors,  but  it 
is  just  the  opposite  when  the  number  of  processors  and  the  matrix  size  increase, 
due  to  the  overlapping  of  computations  and  communications  on  the  algorithms 
for  a  mesh. 

Some  attention  has  been  paid  to  the  optimization  of  parallel  Jacobi  meth¬ 
ods  by  overlapping  communication  and  computation  [10,  11],  and  in  the  mesh 
algorithms  here  analysed  the  inbalance  in  the  computation  of  rotation  matrices 
makes  possible  this  overlapping.  Adding  the  arithmetic  and  the  communication 
costs  111  equations  2  and  3,  4  and  5.  and  6  and  7,  the  total  cost  per  sweep  of  the 
algorithms  for  ring,  square  mesh  and  triangular  mesh,  respectively,  can  be  estim¬ 
ated;  but  these  times  have  been  obtained  without  regard  to  the  overlapping  of 
computation  and  communication.  In  the  algorithm  for  a  ring  there  is  practically 
no  overlapping  because  the  computation  of  rotations  is  balanced,  and  after  the 
computation  of  the  rotation  matrices  each  processor  is  involved  in  the  broad¬ 
cast .  and  the  updating  of  matrices  begins  only  after  the  broadcast  finishes.  But 
in  the  algorithms  for  mesh  the  computation  of  rotations  is  performed  only  by 
the  processors  in  the  main  diagonal  or  antidiagonal  in  the  system  of  processors, 
and  this  makes  the  overlapping  possible. 

To  compare  in  more  detail  the  three  methods,  in  table  1  the  costs  pei  sweep 
of  each  part  of  the  algorithms  are  shown. 

The  three  algorithms  have  an  isoefficiency  function  /(n)  =  p,  but  the  al¬ 
gorithms  for  mesh  are  more  scalable  in  practice.  The  value  of  the  isoefficiency 
function  appears  from  the  term  corresponding  to  rotations  broadcast,  which  has 
a  cost  O  (n-p),  but  in  the  algorithm  for  a  ring  this  is  the  real  cost,  because  the 
matrices  .4  and  T  can  not  be  updated  before  the  rotations  have  been  broadcast. 
It  is  different  in  the  algorithms  for  a  mesh  topology,  where  the  execution  times 
obtained  are  upper-bounds.  In  these  algorithms  the  rotations  broadcast  can  be 
overlapped  with  the  updating  of  the  matrices  (as  shown  in  [12]  for  systolic  ar¬ 
rays)  and  when  the  size  of  the  matrices  increases  the  total  cost  can  be  better 
approximated  by  adding  the  costs  of  table  1  but  without  the  cost  of  broadcast, 
which  is  overlapped  with  the  updating  of  the  matrices.  In  this  way.  the  isoef- 
ficiency  function  of  the  algorithms  for  mesh  topology  is  /(n)  =  and  these 
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methods  are  more  scalable. 

In  addition,  few  processors  can  be  utilized  efficiently  in  the  algorithm  for  a 
ring,  for  example,  with  n  —  1024,  if  p  =  64  the  block  size  must  Idc  lower  or  equal 
to  8,  but  when  using  64  processors  on  the  algorithm  for  a  square  mesh  the  block 
size  must  be  lower  or  equal  to  64, 

The  overlapping  of  communication  and  computation  in  the  algorithm  for 
triangular  mesh  is  illustrated  in  figure  7,  In  this  figure  matrices  .4  and  V  are 
shown,  and  a  triangular  mesh  with  21  processors  is  considered.  The  first  steps  of 
the  computation  are  represented  writing  into  each  block  of  the  matrices  which 
part  of  the  execution  is  carried  out:  R.  represents  computation  of  rotations,  B 
broadcast  of  rotation  matrices,  U  matrix  updating,  and  D  transference  of  data. 
The  numbers  represent  which  Jacobi  set  is  involved,  and  an  arrow  indicates  a 
mo\'ement  of  data  between  blocks  in  the  matrices  and  the  corresponding  commu¬ 
nication  of  data  between  processors.  We  will  briefly  explain  some  of  the  aspects 
in  the  figure: 

-  Step  1:  Rotation  matrices  are  computed  by  the  processors  in  the  main  diag¬ 
onal  of  processors. 

-  Step  2:  The  broadcast  of  rotation  matrices  to  the  other  processors  in  the 
same  row  and  column  of  processors  begins. 

-  Step  .3:  Computation  and  communication  are  overlapped.  All  the  steps  in 

the  figure  have  not  the  same  cost  (the  first  step  has  a  cost  24A;i^  and  the 
second  23  +  but  if  a  large  size  of  the  matrices  is  assumed  the  cost 

of  the  computational  parts  is  much  bigger  than  that  of  the  communication 
parts,  therefore  communication  in  this  step  finishes  before  computation. 

-  .Step  4:  More  processors  begin  to  compute  and  the  work  is  more  balanced. 

-  Step  b:  Update  of  matrices  has  finished  in  processors  in  the  main  diagonal 
and  the  subdiagonal  of  processors,  and  the  movement  of  rows  and  columns 
of  blocks  begins  in  order  to  obtain  the  data  distribution  needed  to  perform 
the  work  corresponding  to  the  second  Jacobi  set. 

-  Step  6:  The  computation  of  the  second  set  of  rotation  matrices  begins  in  the 
diagonal  before  the  updating  of  the  matrices  have  finished.  All  the  processors 
are  involved  in  this  step,  and  the  work  is  more  balanced  than  in  the  previous 
steps.  If  the  cost  of  computation  is  much  bigger  than  the  cost  of  commu¬ 
nication,  the  broadcast  could  have  finished  and  all  the  processors  could  be 
computing. 

-  Step  7:  Only  four  diagonals  of  processors  can  be  involved  at  the  same  time 
in  communications.  Then,  if  the  number  of  processors  increases  the  cost  of 
comunication  becomes  less  important. 

-  Step  8:  First  and  second  updating  are  performed  at  the  same  time. 

-  Step  9:  After  this  step  all  the  data  have  been  moved  to  the  positions  corres¬ 
ponding  to  the  second  Jacobi  set. 

-  .Step  10:  When  the  step  finishes  the  third  set  of  rotations  can  be  computed. 

As  we  can  .see,  there  is  an  inbalance  at  the  beginning  of  the  execution,  but  that 
is  compensated  by  the  overlapping  of  communication  and  computation,  which 
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Table  2.  Comparison  of  the  algorithms  for  mesh  topology,  E.Kecution  time  per  swee]-) 
(ill  seconds),  on  an  Intel  Paragon. 


matrix  size  : 
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triangular 

square 
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.Square 
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77.61 
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7,40 
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10 

3.67 

16 

!■■■ 

2.76 

15.02 

36 

64 

wmam 

5.53 

136 

0.66 

■■QQI] 

256 

0.96 

■■■ 

makes  it  possible  to  overlook  the  cost  of  broadcast  to  analyze  the  scalability  of 
the  algorithm.  The  same  happens  with  the  algorithm  for  square  mesh. 

The  algorithm  for  a  triangular  mesh  is  in  practice  the  most  scalable  due 
to  the  overlapping  of  computation  and  communication  and  also  to  the  regular 
distribution  of  data,  which  produces  a  lower  overhead  than  the  algorithm  for  a 
square  mesh.  In  tables  2  and  3  the  two  algorithms  for  mesh  are  compared.  The 
results  have  been  obtained  on  an  Intel  Paragon  XP/S35  with  512  proce.ssors. 
Because'  the  number  of  processors  possible  to  use  is  different  in  both  algorithms, 
the  results  have  been  obtained  for  different  numbers  of  processors.  The  algorithm 
for  a  square  mesh  has  been  executed  on  a  physical  square  mesh,  but  the  algorithm 
for  a  triangular  mesh  has  not  been  executed  in  a  physical  triangular  mesh,  but 
in  a  square  mesh.  This  could  produce  a  reduction  on  the  performance  of  the 
algorithm  for  triangular  mesh,  but  this  reduction  does  not  happen. 

In  table  2  the  execution  time  per  sweep  when  computing  eigenvalues  is  shown 
for  matrix  sizes  512  and  1024. 

In  table  3  the  Mflops  per  node  obtained  with  approximately  the  same  number 
of  data  per  proces.sor  are  shown.  Due  to  the  inbalance  in  the  parallel  algorithms 
the  performance  is  low  with  a  small  number  of  proces.sors,  but  when  the  number 
of  processors  increases  the  inbalance  is  less  important  and  the  performance  of 
the  parallel  algorithms  approaches  that  of  the  sequential  method. 

The  performance  of  the  algorithm  for  triangular  mesh  is  much  better  when 
the  number  of  processors  increases.  For  example,  for  a  matrix  size  of  14(J8.  in  a 
square  mesh  of  484  processors  2.99  Gflops  were  obtained,  while  in  a  triangular 
mesh  of  465  processors  4.23  Gflops  were  obtained. 


6  Conclusions 

We  ha\’e  shown  how  parallel  block-Jacobi  algorithms  can  Ije  designed  in  two 
steps:  first  associating  one  process  to  each  block  in  the  matrices,  and  then  ob- 
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Table  3.  Comparison  of  the  algorithms  for  mesh  topologies,  hlflops  per  node  with 
approximately  the  same  number  of  data  per  processor,  on  an  Intel  Paragon  XP/S:fo. 
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taining  algorithms  for  a  topology  by  grouping  processes  and  assigning  them  to 
processors. 

Scalable  algorithms  have  been  obtained  for  mesh  topologies  and  the  more 
scalable  in  practice  is  the  algorithm  for  a  triangular  mesli. 
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Abstract.  We  consider  the  development  and  implementation  of  eigen- 
solvers  on  distributed  memory  parallel  arrays  of  vector  processors  and 
show  that  the  concomitant  requirements  for  vectorization  and  paralleliza¬ 
tion  lead  both  to  novel  algorithms  and  novel  implementation  techniques. 
Performance  results  are  given  for  several  large-scale  applications  and 
some  performance  comparisons  made  with  LAPACK  and  ScaLAPACK. 


1  Introduction 

Eigenvalue  problems  (EVPs)  arise  ubiquitously  in  the  numerical  simulations  per¬ 
formed  on  today’s  high  performance  computers  (HPCs),  and  often  their  solution 
comprises  the  most  computationally  expensive  algorithmic  component.  It  is  im¬ 
perative  that  efficient  solution  techniques  and  high-quality  software  be  developed 
for  the  solution  of  EVPs  on  high  performance  computers. 

Generally,  specific  attributes  of  an  HPC  architecture  play  a  significant,  if  not 
deterministic,  role  in  terms  of  choosing/designing  appropriate  algorithms  from 
which  to  construct  HPC  software.  There  are  many  ways  to  differentiate  today’s 
HPC  architectures,  one  of  the  coarsest  being  vector  or  parallel,  and  the  respec¬ 
tive  algorithmic  priorities  can  differ  considerably.  However,  on  some  recent  HPC 
architectures  -  those  comprising  a  (distributed-memory)  parallel  array  of  power¬ 
ful  vector  processors,  e.g.,  the  Fujitsu  VPP300  (see  Section  5)  -  it  is  important 
not  to  focus  exclusively  upon  one  or  the  other,  but  to  strive  for  high  levels  o 
both  vectorization  and  parallelization. 

In  this  paper  we  consider  the  solution  of  large-scale  eigenvalue  problems, 

Au  =  Xu,  (1) 

and  discuss  how  the  concomitant  requirements  for  vectorization  and  paralleliza¬ 
tion  on  vector  parallel  processors  has  lead  both  to  novel  implementations  of 
knowm  methods  and  to  the  development  of  completely  new  algorithms.  These 
include  techniques  for  both  symmetric  (or  Hermitian)  and  nonsymmetric  A  and 
for  matrices  with  special  structure,  for  example  tridiagonal,  narrow'-banded,  etc. 
Performance  results  are  presented  for  various  large-scale  problems  solved  on  a 
Fujitsu  (Vector  Parallel  Processor)  VPP300,  and  some  comparisons  are  made 
with  analogous  routines  from  the  LAPACK  [1]  and  ScaLAPACK  [4]  libraries. 
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2  Tridiagonal  Eigenvalue  Problems 

Consider  (1)  with  A  replaced  by  a  symmetric,  irreducible,  tridiagonal  matrix  T\ 

Tu-Xu,  where  T  =  tridiag(/3,',a,,/9i+i),  with  /?,;  7^  0,  i  =  2,---,n.  (2) 

One  of  the  most  popular  methods  for  computing  (selected)  eigenvalues  of  T  is 
bisection  based  on  the  Sturm  sign  count.  The  widely  used  LAPACK/Scalapack 
libraries,  for  example,  each  contain  routines  for  (2)  which  use  bisection.  Since 
the  techniques  are  standard,  they  are  not  reviewed  in  detail  here  (see,  e.g., 
[26]).  Instead,  we  focus  on  the  main  computational  component;  Evaluation  of 
the  Sturm  sign  count,  defined  by  the  recursion 

cri(A)=Qi-A,  (7,(A)  =  *  =  b2,---,n.  (3) 

Zeros  of  <7,, (A)  are  eigenvalues  of  T,  and  the  number  of  times  cri(A)  <  0,  z  = 
1,  •  •  •  ,n,  is  equal  to  the  number  of  eigenvalues  less  than  the  approximation  A. 

Vectorization  via  multisection:  In  terms  of  vectorization,  the  pertinent  ob¬ 
servation  is  that  the  recurrence  (3)  does  not  vectorize  over  the  index  i.  However, 
if  it  is  evaluated  over  a  sequence  of  m  estimates,  Xj,  it  is  trivial  implementa- 
tionally  to  interchange  i~  and  j-loops  and  vectorize  over  j.  This  is  the  basic  idea 
behind  multisection:  An  interval  containing  eigenvalues  is  split  into  more  than 
the  two  subintervals  used  with  bisection.  The  hope  is  that  the  efficiency  obtained 
via  vectorization  offsets  the  spurious  computation  entailed  in  sign  count  evalu¬ 
ation  for  subintervals  containing  no  eigenvalues.  For  r  >  1  eigenvalues,  another 
way  to  vectorize  is  to  bisect  r  eigenvalue  intervals  at  a  time,  i.e.  multi-bisection.^ 
On  scalar  processors  bisection  is  optimal.^  On  vector  processors,  however,  this 
is  not  the  case,  as  shown  in  an  aptly-titled  paper  [37];  “Bisection  is  not  optimal 
on  vector  processors”.^  The  non-optimality  of  bisection  on  a  single  VPP300 
PE  (“processing  element”)  is  illustrated  in  Figure  1  (left),  where  we  plot  the 
time  required  to  compute  one  eigenvalue  of  a  tridiagonal  matrix  (n  =  1000) 
as  a  function  of  m,  the  number  of  multisection  points  (i.e.  vector  length).  The 
tolerance  is  e  =  3  x  10“^®.  For  all  plots  in  this  section,  times  are  averages  from 
25  runs. 

Clearly,  the  assertion  in  [37]  holds:  Bisection  is  not  optimal.  In  fact,  multi¬ 
section  using  up  to  3400  points  is  superior.  The  minimum  time,  obtained  using 
70  points,  is  roughly  17%  the  bisection  time,  i.e.  that  obtained  using  LAPACK. 

'  Nomenclaturally,  multisecting  r  intervals  might  consistently  be  termed  ‘'multi¬ 
multisection”  ;  we  make  no  distinction  and  refer  to  this  also  as  “multisection” . 

^  This  is  probably  not  true  for  most  superscalar  processors:  these  should  be  able  to 
take  advantage  of  the  chaining/pipelining  inherent  in  multisection. 

®  This  may  not  be  true  for  vector  processors  of  the  near  future,  nor  even  perhaps 
all  of  today’s,  specifically  those  with  large  tii/2,  compared  with,  e.g.,  those  in  [37]. 
Additionally,  many  of  today’s  vector  PEs  have  a  separate  scalar  unit  so  it  is  not 
justifiable  to  model  bisection  performance  by  assuming  vector  processing  with  vector 
length  one  -  bisection  is  performed  by  the  scalar  unit.  See  the  arguments  in  [10]. 
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Fig.  1.  (Left)  Time  vs.  number  of  multisection  points  m  =  1,256.  (Right)  Opti¬ 
mal  (i.e.,  time-minimizing)  m  vs.  accuracy  e;  dotted  lines  show  Tn„i\n{^)  for  fixed 
numbers  of  multisection  steps  i/  =  2, . . . ,  7  (from  left  to  right). 

Once  it  is  decided  to  use  multisection,  there  still  remains  a  critical  question: 
What  is  the  optimal  number  of  multisection  points  (equivalently,  vector  length)? 
The  answer,  which  depends  on  r,  is  discussed  in  detail  in  [9];  here  we  highlight 
only  a  few  key  observations  and  limit  discussion  to  the  case  r  =  1. 

Although  not  noted  in  [37],  the  optimal  number  of  points,  niopt  ,  depends  on 
the  desired  accuracy  e,  as  illustrated  in  Figure  1  (right),  where  we  plot  mopt  vs. 
e  e  [10“'^,  3  X  10“^®].  Note  that  mopt  varies  between  about  50  and  250. Two 
effects  explain  this  apparently  erratic  behavior  -  one  obvious,  one  not  so. 

To  reach  a  desired  accuracy  of  e  using  m  multisection  points  requires 

ly  = -|'log€/log(m -h  1)1  (4) 

multisection  steps.  Generally,  mopt(e)  corresponds  to  some  mmin(t', e)  at  which 
the  ceiling  function  in  (4)  causes  a  change  in  w,  that  is,  a  minimal  number  of 
points  effecting  convergence  to  an  accuracy  of  e  in  i/  steps.  The  dotted  lines  in 
Figure  1  (right)  indicate  mopt(i^if)  for  fixed  u.  As  e  is  decreased  (i.e.,  moving  to 
the  right)  mmin(z^,  e)  increases  until  for  some  ecrit(j^)  it  entails  too  much  spurious 
computation  in  comparison  with  the  (smaller)  vector  length,  mmin(r'+li  c)j  which 
is  now  large  “enough”  to  enable  adequate  vectorization.  The  optimal  now  follow's 
along  the  curve  +  T^)  until  reaching  ecrit(r'  +  1),  etc.  This  explains  the 

occasional  transitions  from  one  z^-curve  to  a  i/  +  1-curve,  but  not  the  oscillatory 
switching.  This  is  an  artifact  of  a  specific  performance  anomaly  associated  with 
the  VPP300  processor,  as  we  now  elucidate. 

In  Figure  2  we  extend  the  range  of  the  plot  on  the  left  in  Figure  1  to  include 
vector  lengths  up  to  2048.  The  plot  “wraps  around”  in  that  the  bottom  line 
{i  =  0)  is  for  m  =  1,...,256,  the  second  {i  =  1)  for  m  =  257,..., 512,  etc. 
Accuracy  is  now  e  =  3  x  10~^^.  Note  that  there  is  a  10-20/^  increase  in  time 
when  the  vector  length  increases  from  64i  -1-  8  to  64z  +  9  (dotted  vertical  lines) 
throughout  the  entire  range  of  vector  lengths  (i  =  0,  •  •  • ,  31),  though  the  effect 
lessens  as  i  increases.  This  anomalous  behavior  fosters  the  erraticity  in  Figure  1 
(right).  Whenever  =  73, 137,201, ...  (or  one  or  two  more),  the  time 

is  decreased  using  rhmin  =  uiniin(i^;£)  points,  w’here  0  ^  u  is  such  that  rhmin  is 
not  an  anomalous  vector  length. 

''  These  values  of  mopt  are  roughly  five  to  twenty  times  those  determined  in  [37]. 

manifesting  the  effect  of  the  significantly  larger  ni/2  of  the  VPPSOO. 
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Fig.  2.  Time  vs.  vector  length  m  =  1, . . . ,  2048.  Dotted  lines  at  m  =  64i  +  9. 

This  performance  anomaly  is  apparent  also  for  standard  vector  operations 
(addition,  multiplication,  etc.)  on  the  VPP300  [9];  we  do  not  believe  it  has  been 
noted  before,  and  we  are  also  currently  unable  to  explain  the  phenomenon.  The 
savings  is  of  course  only  10-20%  but  it  is  still  probably  worthwhile  to  avoid  these 
anomalous  vector  lengths,  in  general. 

When  computing  r  >  1  eigenvalues,  the  optimal  number  of  points  per  eigen¬ 
value  interval,  mopt>  tends  to  decrease  as  r  increases,  until  for  some  r  =  rt, 
multi-bisection  is  preferable  to  multisection;  this  occurs  when  r  is  large  enough 
that  multi-bisection  entails  a  satisfactory  vector  length  (in  relation  to  nif-y)- 
The  value  of  rt  at  which  this  occurs  depends  on  the  desired  accuracy  e.  For 
more  details  see  [9]. 

Parallelization  -  Invariant  Subspaces  for  Clustered  Eigenvalues:  Eigen¬ 
vectors  are  calculated  using  the  standard  technique  of  inverse  iteration  (see,  e.g., 
[14,40]).  Letting  A;  denote  a  converged  eigenvalue,  choose  u°  and  iterate 

(T  -  Ai/)u'' =  A:  =  l,2,---. 

Generally,  one  step  suffices.  Solution  of  these  tridiagonal  linear  systems  is  effi¬ 
ciently  vectorized  via  “wrap-around  partitioning” ,  discussed  in  Section  4. 

The  computation  of  distinct  eigenpairs  is  communication  free.  However,  com¬ 
puted  invariant  subspaces  corresponding  to  multiple  or  tightly  clustered  eigen¬ 
values  are  likely  to  require  reorthogonalization.  If  these  eigenvalues  reside  on 
different  PEs,  orthogonalization  entails  significant  communication.  Hence,  clus¬ 
tered  eigenvalues  should  reside  on  individual  PEs.  This  can  be  accomplished  in 
a  straightforward  manner  if  complete  spectral  data  is  available  to  all  PEs  -  for 
example,  if  the  eigenvalue  computation  is  performed  redundantly  on  each  PE, 
or  if  all-to-all  communication  is  initiated  after  a  distributed  computation  -  in 
which  case  redistribution  decisions  can  be  made  concurrently.  However,  we  insist 
on  a  distributed  eigenvalue  calculation  and  opine  that  all-to-all  communication 
is  too  expensive. 

We  detect  clustering  during  the  refinement  process  and  effect  the  redistribu¬ 
tion  in  an  implicit  manner.  Once  subinterval  widths  reach  a  user-defined  “cluster 
tolerance”,  the  smallest  and  largest  sign  counts  on  each  PE  are  communicated 
to  the  PE  which  w'as  initially  allocated  eigenvalues  with  those  indices,  and  it  is 
decided  which  PE  keeps  the  cluster.  This  PE  continues  refinement  of  the  clus¬ 
tered  eigenvalues  to  the  desired  accuracy  and  computes  a  corresponding  invariant 
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subspace;  the  other  ignores  the  clustered  eigenvalues  and  continues  refinement 
of  any  remaining  eigenvalues.  If  a  cluster  extends  across  more  than  two  PEs 
intermediate  PEs  are  dropped  from  the  computation.  Load-imbalance  is  likeK 
(and  generally  unavoidable)  with  the  effect  worsening  as  cluster  sizes  increase. 
This  approach  differs  from  that  used  in  the  equivalent  ScaLAPACK  routines  [5]. 

We  note  that  significant  progress  has  recently  been  made  in  the  computation 
of  orthogonal  eigenvectors  for  tightly  clustered  eigenvalues  [7, 27],  the  goal  being 
to  obtain  these  vectors  without  the  need  for  reorthogonalization.  These  ideas 
have  not  yet  been  used  here,  though  they  may  be  incorporated  into  later  versions 
of  our  routines;  it  is  expected  they  will  be  incorporated  into  future  releases  of 
LAPACK  and  ScaLAPACK  [6]. 

3  Symmetric  Eigenvalue  Problems 

The  methods  we  use  for  symmetric  EVPs  entail  the  solution  of  tridiagonal  EVPs, 
and  this  is  accomplished  using  the  procedures  just  described;  overall,  the  tech¬ 
niques  are  relatively  standard  and  are  not  discussed  in  detail. 

The  first  is  based  on  the  usual  Householder  reduction  to  tridiagonal  form, 
parallelized  using  a  panel- wrapped  storage  scheme  [8];  there  is  also  a  version  for 
Hermitian  matrices.  For  sparse  matrices  we  use  a  Lanczos  method  [16].  Perfor¬ 
mance  depends  primarily  on  efficient  matrix-vector  multiplication;  the  routine 
uses  a  diagonal  storage  format,  loop  unrolling,  and  column-block  matrix  distribu¬ 
tion  for  parallelization.  The  tridiagonal  EVPs  that  arise  are  solved  redundantl\ 
using  a  single-PE  version  of  the  tridiagonal  eigensolver.  No  performance  data 
are  presented  for  the  Lanczos  solver,  but  comparisons  of  the  Householder-based 
routine  with  those  in  LAPACK /ScaLAPACK  are  included  in  Section  o. 


4  Nonsymmetric  Eigenvalue  Problems 

Arnoldi  Methods:  Arnoldi’s  method  [2]  was  originally  developed  to  reduce  a 
matrix  to  upper  Hessenberg  form;  its  practicability  as  a  Krylov  subspace  pro¬ 
jection  method  for  EVPs  was  established  in  [31].  Letting  Vm  —  [^i|  ' ' '  l^m]  de¬ 
note  the  matrix  whose  columns  are  the  basis  vectors  (orthonormalized  via,  e.g., 
modified  Gram-Schmidt)  for  the  m-dimensional  Krylov  subspace,  we  obtain  the 
projected  EVP 

Hy  =  V;:,AVmy  =  Xy,  (5) 

where  H  is  upper  Hessenberg  and  of  size  m  C  n.  This  much  smaller  eigenproblem 
is  solved  using,  e.g.,  a  QR  method.  Dominant  eigenvalues  of  H  approximate  those 
of  A  with  the  accuracy  increasing  with  m. 

A  plethora  of  modifications  can  be  made  to  the  basic  Arnoldi  method  to 
increase  efficiency,  robustness,  etc.  These  include:  restarting  [31],  including  the 
relatively  recent  implicit  techniques  [18,38];  deflation,  implicit  or  explicit,  when 
computing  r  >  1  eigenvalues;  preconditioning/acceleration  techniques  based  on. 
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e.g.,  Chebyshev  polynomials  [32],  least-squares  [33,34],  etc.;  spectral  transfor¬ 
mations  for  computing  non-extremal  eigenvalues,  e.g.,  shift-invert  [28],  Cayley 
[17,22],  etc.;  and  of  course  block  versions  [35,36].  For  a  broad  overview  see.  [34]. 

Our  current  code  is  still  at  a  rudimentary  stage  of  development,  but  we  have 
incorporated  a  basic  restart  procedure,  shift-invert,  and  an  implicit  deflation 
scheme  similar  to  that  outlined  in  [34]  and  closely  related  to  that  used  in  [30]. 
Although  it  is  possible  to  avoid  most  complex  arithmetic  even  in  the  case  of  a 
complex  shift  [28],  our  routine  is  currently  restricted  to  real  shifts. 

In  order  to  be  better  able  to  compute  multiple  or  clustered  eigenvalues  w^e 
are  also  developing  a  block  version.  Here  matrix-vector  multiplication  is  replaced 
by  matrix-matrix  multiplication,  leading  to  another  potential  advantage,  partic¬ 
ularly  in  the  context  of  high  performance  computing:  They  enable  the  use  of 
level-3  BIAS.  On  some  machines  this  can  result  in  block  methods  being  prefer¬ 
able  even  in  the  case  of  computing  only  a  single  eigenvalue  [36]. 

Parallelization  opportunities  seem  to  be  limited  to  the  reduction  phase  of 
the  algorithm.  Parallelizing,  e.g.,  QR  is  possible  and  has  been  considered  by 
various  authors  (see,  e.g.,  [3,13]  and  the  references  therein);  however,  since  the 
projected  systems  are  generally  small,  it  is  probably  not  worthwhile  paralleliz¬ 
ing  their  eigensolution.  This  is  the  approach  taken  with  P_ARPACK  [21],  an 
implementation  of  ARPACK  [19]  for  distributed  memory  parallel  architectures; 
although  these  packages  are  based  on  the  implicitly  restarted  Arnold!  method 
[18,38],  parallelization  issues  are,  for  the  most  part,  identical  to  those  for  the 
standard  methods.  It  is  probably  more  worthwhile  to  limit  the  maximum  Hes- 
senberg  dimension  to  one  that  is  viably  solved  redundantly  on  each  processor 
and  focus  instead  on  increasing  the  efficiency  of  the  restarting  and  deflation 
procedures  and  to  add  some  form  of  preconditioning/acceleration;  however,  the 
choices  for  these  strategies  should,  on  vector  parallel  processors,  be  predicated  on 
their  amenability  to  vectorization.  As  mentioned,  our  code  is  relatively  nascent, 
and  it  has  not  yet  been  parallelized,  nor  efficiently  vectorized. 

Newton-Based  Methods:  Let  A'  :  C”  -t  C"  and  consider  the  eigenvalue 
problem 

A:(A)u  =  0,  K{\)  =  {A-\I).  (6) 

(For  generalized  EVPs,  Au  =  ABu,  define  K{\)  =  {A  —  XB).)  Reasonable 
smoothness  of  A' (A)  is  assumed  but  it  need  not  be  linear  in  A.  The  basic  idea 
behind  using  Newton’s  method  to  solve  EVPs  is  to  replace  (6)  by  the  problem 
of  finding  zeros  .of  a  nonlinear  function.  Embed  (6)  in  the  more  general  family 

K{\)u  =  l3{X)x,  s*u  —  K.  (7) 

As  A  approaches  an  eigenvalue  A(A)  becomes  singular  so  the  solution  u  of  the 
first  equation  in  (7)  becomes  unbounded  for  almost  all  f3{X)x.  Hence,  the  second 
equation  -  a  scaling  condition  -  can  only  be  satisfied  if  /3(A)  0  as  A  approaches 

an  eigenvalue.  The  vectors  s  and  x  can  be  chosen  dynamically  as  the  iteration 
proceeds;  this  freedom  results  in  the  possibility  of  exceeding  the  second-order 
convergence  rate  characteristic  of  New^ton-based  procedures  [25]. 
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=  0. 


Differentiating  equations  (7)  with  respect  to  A  gives 

d(3  .  du 

T*  C  — 

dX 

Solving  for  the  Newton  correction,  the  Newton  iteration  takes  the  form 

/3(A)  s*u 


^,du  dK 

^dx^lx'^-dx'^' 


A 


A  -  ziA,  ZiA  = 


d\ 


s*K-^^u 


Note  that  for  the  non-generalized  problem  (1),  dK/dX  -  -I.  The  main  com¬ 
putational  component  is  essentially  inverse  iteration  with  the  matrix  K:  this  is 
effected  using  a  linear  solver  highly  tuned  for  the  VPP300  [23]. 

Convergence  rates,  including  conditions  under  which  third-order  convergence 
is  possible,  are  discussed  in  [24].  A  much  more  recent  reference  is  [25]  in  which 
the  development  is  completely  in  terms  of  generalized  EVPs. 

Deflation  for  k  converged  eigenvalues  can  be  effected  by  replacing  /3(A)  with 

However,  it  is  likely  to  be  more  beneficial  to  use  eis  much  of  the  existing  spectral 
information  as  possible  (i.e.,  not  only  the  eigenvalues).  Weilandt  deflation  (see, 
e.g.,  [34])  requires  knowledge  of  left  and  right  eigenvectors,  hence  involves  matrix 
transposition  which  is  highly  inefficient  on  distributed  memory  architectures. 
Hence,  we  opt  for  a  form  of  Schur- Weilandt  deflation;  see  [25]  for  details. 

A  separate  version  of  the  routines  for  the  Newton-based  procedures  has  been 
developed  specifically  for  block  bidiagonal  matrices.  This  algorithm  exhibits  an 
impressive  convergence  rate  of  3.56,  and  uses  a  multiplicative  form  of  Wielandt 
deflation  so  as  to  preserve  matrix  structure.  Implementationally,  inversion  of 
block  bidiagonal  matrices  is  required,  and  this  is  efficiently  vectorized  by  the 
technique  of  wrap-around  partitioning,  which  we  now  describe. 

Vectorization  -  Wrap-Around  Partitioning  for  Banded  Systems:  Wrap¬ 
around  partitioning  [12]  is  a  technique  enabling  vectorization  of  the  elimination 
process  in  the  solution  of  systems  of  linear  equations.  Unknowns  are  reordered 
into  q  blocks  of  p  unknowns  each,  thereby  highlighting  groups  of  unknowns  which 
can  be  eliminated  independently  of  one  another.  The  natural  formulation  is  for 
matrices  with  block  bidiagonal  (BBD)  structure,  shown  below  on  the  left,  but 
the  technique  is  also  applicable  to  narrow-banded  matrices  of  sufficiently  regular 
structure,  as  illustrated  by  reordering  a  tridiagonal  matrix  to  have  BBD  form, 
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Significant  speed-ups  -  of  roughly  a  factor  of  20  over  scalar  speed  -  are  easily 
attainable  for  matrices  of  size  n  >  1000  in  the  case  that  the  subblock  dimension, 
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m,  is  small  (rn  =  2  for  tridiagonal  matrices).  The  case  q  =  2  corresponds  to  cyclic 
reduction,  but  with  wrap-around  partitioning  p  and  q  need  not  be  exact  factors  of 
71 ;  this  means  that  stride- two  memory  access,  which  may  result  in  bank  conflicts 
on  some  computers  (e.g.,  the  VPP300),  can  be  avoided.  However,  q  should  be 
relatively  small;  stride-3  has  proven  effective  on  the  VPP300.  Orthogonal  factor¬ 
ization  is  used  rather  than  Gaussian  elimination  with  partial  pivoting  since  the 
latter  is  known  to  have  less  favorable  stability  properties  for  BED  matrices  [41], 
Stable  factorization  preserves  BBD  structure  so  that  wrap-around  partitioning 
can  be  applied  recursively. 

An  example  is  illustrative.  Consider  a  BBD  matrix  of  size  n  =  11.  With 
inexact  factors  p  =  4  and  g  =  2  the  reordered  matrix  has  the  form  shown 
below  on  the  left.  In  the  first  stage  the  blocks  B,.  i  =  2, 4, 6, 8  are  eliminated; 
since  each  is  independent  of  the  others  this  elimination  can  be  vectorized.  Fill-in 
occurs  whether  orthogonal  factorization  or  Gaussian  elimination  with  pivoting 
is  used;  this  is  shown  at  the  right,  where  blocks  remaining  nonzero  are  indicated 
by  *,  deleted  blocks  by  0,  and  blocks  becoming  nonzero  by  a  1  indicating  fill-in 
occurred  in  that  position  during  the  first  stage. 
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The  potential  for  recursive  application  of  the  reordering  is  evinced  by  noting 
that  the  trailing  block  2x2  submatrix  -  the  one  now  requiring  elimination  -  is 
again  block  bidiagonal.  Recursion  terminates  when  the  final  submatrix  is  small 
enough  to  t>e  solved  sequentially. 

Arnoldi-Newton  Methods:  Although  the  Newton-based  procedures  ultimately 
result  in  convergence  rates  of  up  to  3.56,  they  suffer  when  good  initial  data  are 
unavailable;  unfortunately  this  is  often  the  case  when  dealing  with  large-scale 
EVPs.  Conversely,  Arnold!  methods  seem  nearly  always  to  move  in  the  right 
direction  at  the  outset,  but  may  stall  or  breakdown  as  the  iteration  continues, 
for  instance  if  the  maximal  Krylov  dimension  is  chosen  too  small.  Heuristics  are 
required  to  develop  efficient,  robust  Arnold!  eigensolvers.  In  a  sense  then,  these 
methods  may  be  viewed  as  having  orthogonal  difficulties;  Newton  methods  suffer 
at  the  outset,  but  ultimately  perform  very  well,  and  Arnold!  methods  start  off 
well  but  perhaps  run  into  difficulties  as  the  iteration  proceeds.  For  this  reason  we 
have  considered  a  composition  of  the  two  methods:  The  Arnold!  method  is  used 
to  get  good  initial  estimates  to  the  eigenvalues  of  interest  and  their  correspond¬ 
ing  Schur  vectors  for  use  with  the  Newton-based  procedures.  These  methods  are 
in  the  early  stages  of  development. 
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5  Performance  Results  on  Applications 


We  now  investigate  the  performance  of  some  of  our  routines  on  eigenvalue  prob 
lems  arising  in  a  variety  of  applications  and  make  some  comparisons  with  corre¬ 
sponding  routines  from  LAPACK  and  ScaLAPACK.  More  details  and  extended 
performance  results  on  these  applications  can  be  found  in  [11]- 

Fujitsu  VPP300:  All  performance  experiments  were  performed  on  the  thirteen 
processor  Fujitsu  VPP300  located  at  the  Australian  National  University.  The 
Fujitsu  VPP300  is  a  distributed-memory  parallel  array  of  powerful  vector  PEs, 
each  with  a  peak  rate  of  2.2  Gflops  and  512  MB  or  1.92  GB  of  memory  -  the 
ANU  configuration  has  a  peak  rate  of  about  29  Gflops  and  roughly  14  GB  of 
memory.  The  PEs  consist  of  a  scalar  unit  and  a  vector  unit  with  one  each  of 
load,  store,  add,  multiply  and  divide  pipes.  The  network  is  connected  via  a  full 
crossbar  switch  so  all  processors  are  equidistant;  peak  bandwidth  is  570MB/s 
bi-directional.  Single-processor  routines  are  written  in  Fortran  90  and  parallel 
routines  in  VPP  Fortran,  a  Fortran  90-based  language  with  compiler  directives 
for  data  layout  specification,  communication,  etc. 

Tridiagonal  EVP  -  Molecular  Dynamics;  First  we  consider  an  application 
arising  in  molecular  dynamics.  The  multidimensional  Schrodinger  equation  de¬ 
scribing  the  motion  of  polyatomic  fragments  cannot  be  solved  analytically.  In  nu¬ 
merical  simulations  it  is  typically  necessary  to  use  many  hundreds  of  thousands 
of  basis  functions  to  obtain  accurate  models  of  interesting  reaction  processes; 
hence,  the  construction  and  subsequent  diagonalization  of  a  full  Hamiltonian 
matrix,  H,  is  not  viable.  One  frequently  adopted  approach  is  to  use  a  Lanczos 
method,  but  the  Krylov  dimension  -  the  size  of  the  resulting  tridiagonal  ma¬ 
trix,  T  -  often  exceeds  the  size  of  H.  Thus,  computation  of  the  eigenvalues  of  T 
becomes  the  dominant  computational  component. 

To  test  the  performance  of  the  tridiagonal  eigensolver  described  in  Section  2 
we  compute  some  eigenvalues  and  eigenvectors  for  a  tridiagonal  matiix  of  ordei 
71  =  620,000  arising  in  a  molecular  dynamics  calculation  [29].  In  Table  1  we 
present  results  obtained  using  our  routines  and  the  corresponding  ones  from 
LAPACK  on  a  single  VPP300  processor  for  the  computation  of  one  eigenpair  and 
100  eigenpairs  of  the  11411  which  are  of  interest  for  this  particular  matrix.  For 
further  comparison  we  also  include  times  for  the  complete  eigendecomposition 
of  matrices  of  size  1000,  3000,  and  5000.  Values  are  computed  to  full  machine 
precision. 


#  A  1 

1 

100 

1000 

3000 

5000 

SSL2VP 

23.06 

162.7 

3.208 

23.90 

63.43 

LAPACK 

25.11 

768.1 

7.783  (25.46) 

132.4  (395.5) 

581.1  (1717.) 

Table  1.  Tridiagonal  eigensolver:  Molecular  dynamics  application. 


161 


FEUP  ■  Faculdade  de  Engenharia  da  Universidade  do  Porto 


For  large  numbers  of  eigenvalues  the  optimal  form  of  multisection  is  multi¬ 
bisection  so  that  there  are  no  significant  performance  differences  between  the 
two  eigenvalue  routines.  However,  the  eigenvector  routine  using  wrap-around 
partitioning  is  significantly  faster  than  the  LAPACK  implementation,  resulting 
in  considerably  reduced  times  when  large  numbers  of  eigenvectors  are  required. 
We  note  that  the  LAPACK  routine  uses  a  tridiagonal  QR  method  when  all 
eigenvalues  are  requested.  If  r  <  n  -  1  are  requested,  bisection  with  inverse 
iteration  is  used;  the  parenthetical  times  -  a  factor  of  three  larger  -  are  those 
obtained  computing  n  -  1  eigenpairs  and  serve  further  to  illustrate  the  efficiency 
of  our  implementation.  Effective  parallelization  is  evident  from  the  performance 
results  of  the  symmetric  eigensolver,  which  we  next  address. 

Symmetric  EVP  -  Quantum  Chemistry:  An  application  arising  in  compu¬ 
tational  quantum  chemistry  is  that  of  modelling  electron  interaction  in  protein 
molecules.  The  eigenvalue  problem  again  arises  from  Schrodinger’s  equation, 
=  E^,  where  H  is  the  Hamiltonian  operator,  E  is  the  total  energy  of  the 
system,  and  ^  is  the  wavefunction.  Using  a  semi-empirical,  as  opposed  to  ab 
initio,  formulation  we  arrive  at  an  eigenvalue  problem. 

Ftp  =  etp, 

where  e  is  the  energy  of  an  electron  orbital  and  tp  the  corresponding  wave- 
function.  The  software  package  MND094  [39]  is  used  to  generate  matrices  for 
three  protein  structures:  (1)  single  helix  of  pheromone  protein  from  Euplotes 
Raikovi  (573  atoms,  n  =  1482),  (2)  third  domain  of  turkey  ovomucoid  inhibitor 
(814  atoms,  n  =  2068),  and  (3)  bovine  pancreatic  ribonuclease  A  (1856  atoms, 
n  =  4709) .  In  Table  2  we  give  the  times  required  to  compute  all  eigenpairs  of  the 
resulting  symmetric  EVPs  using  our  routines  based  on  Householder  reduction 
and  the  tridiagonal  eigensolver.  Also  given  are  times  obtained  using  LAPACK 
and  ScaLAPACK.  As  noted  above,  when  all  eigenvalues  are  requested  these  rou¬ 
tines  use  QR  on  the  resulting  tridiagonal  system  and  this  is  faster  than  their 
routines  based  on  bisection  and  inverse  iteration.  Thus,  if  fewer  than  n  eigenval¬ 
ues  are  requested  the  performance  differences  between  those  routines  and  ours 
are  amplified  considerably. 


n 

1  802  1 

2068 

4709  1 

#PEs 

MM 

2 

IT] 

IBH 

2  1  4 

2 

mm\ 

SSL2VP(P) 

|4.130 

[35.55 

24.76|12.69| 

[373.1 

217.5 

in 

(Sca)LAPACK 

6.437| 

m 

roitiVjgEIgSII 

B 

jgeiai 

Table  2.  Symmetric  eigensolver:  Quantum  chemistry  application. 


The  reduction  and  eigenvector  recovery  algorithmic  components  are  appar¬ 
ently  more  efficiently  implemented  in  ScaLAPACK,  but  the  efficiency  of  our 
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tridiagonal  eigensolvers  results  in  superior  performance  for  the  SSL2VP(P)  rou¬ 
tines.  Good  scalability  of  our  parallel  implementation  is  also  evident. 

Nonsymmetric  EVP  -  Optical  Physics;  Now  we  consider  an  application 
from  optical  physics,  namely  the  design  of  dielectric  waveguides,  e.g.,  optical 
fibers.  We  solve  the  vector  wave  equation  for  general  dielectric  waveguides  which, 
for  the  magnetic  field  components,  takes  the  form  of  two  coupled  PDEs 

d'^H^  _  ^dlnjn)  / ^  ^  ,2^  _  =  0 

dx-  dy^  dy  \  dx  dy  ) 

d'Hy  d'^Hy  _  ^51n(n)  ^  ^  0. 

dx?  dy^  dx  \  dx  dy  ) 

Following  [20],  a  Galerkin  procedure  is  applied  to  this  system,  resulting  in  a 
coupled  system  of  algebraic  equations.  Given  an  optical  fiber  with  indices  of 
refraction  and  n,  for  the  cladding  and  core  regions,  respectively,  eigenvalues 
of  interest  correspond  to  “guided  modes”  and  are  given  by  A  =  (/3/A:)'  ,  w-here 
13 fk  €  [uo^Tii).  The  matrix  is  full,  real,  and  nonsymmetric;  despite  the  lack  of 
symmetry  all  eigenvalues  are  real. 

Since  the  Arnoldi-based  procedures  have  not  yet  been  efficiently  vector¬ 
ized/parallelized  we  do  not  compare  performance  against,  e.g.,  ARPACK  or 
P_ARPACK.  Instead  we  illustrate  the  considerable  reduction  in  time  obtained 
using  the  Arnoldi  method  to  acquire  good  initial  data  for  the  Newton-based 
procedures  -  that  is,  the  Arnoldi- Newton  method.  This  comparison  is  somewhat 
contrived  since  the  Newton  codes  use  full  complex  arithmetic  and  the  matrix  for 
this  problem  is  real.  However,  it  serves  to  elucidate  the  effectiveness  of  combin¬ 
ing  the  two  procedures.  Considering  a  fiber  with  indices  of  refraction  rio  =  L265 
and  Tii  -  1.415,  we  use  shift-invert  Arnoldi  (in  real  arithmetic)  with  a  shift  of 
o  =  1.79  €  [1.265'^  1.415^).  Once  Ritz  estimates  are  0(1  x  10“®),  eigenvalues 
and  Schur  vectors  are  passed  to  the  Newton-based  routine.  Generally,  only  one 
Newton  step  is  then  necessary  to  reach  machine  precision.  Times  for  complex 
LU  factorization  are  shown,  and  the  number  of  factorizations  required  with  the 
Newton  and  Arnoldi-Newton  methods  appears  in  parentheses. 


n 

cmplx  LU 

Newton 

Arnoldi-Newton 

1250 

4.105 

1414.  (108) 

328.4  (26) 

3200 

62.92 

20904  (103) 

3597.  (31) 

Table  3.  Nonsymmetric  eigensolver:  Optical  physics  application. 


Matrices  are  built  using  the  (C-b-l-)  software  library  NPL  [15]  which  also 
includes  an  eigensolver;  our  routine  consistently  finds  eigenvalues  w-hich  NPL  s 
fails  to  locate  -  in  this  example,  NPL  located  ten  eigenvalues  of  interest  while  our 
routines  find  fourteen.  The  techniques  used  here  are  robust  and  effective.  Cleaih 
the  use  of  Arnoldi’s  method  to  obtain  initial  eigendata  results  in  a  significant 
reduction  in  time.  More  development  work  is  needed  for  these  methods  and  for 
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their  implementations,  in  particular  the  (block)  Arnold!  routine  which  is  not 
nearly  fully  optimized. 


Complex  Nonsymmetric  Block  Bidiagonal  EVP  -  CFD:  Our  final  appli¬ 
cation  is  from  hydrodynamic  stability;  we  consider  flow  between  infinite  parallel 
plates.  The  initial  behavior  of  a  small  disturbance  to  the  flow  is  modelled  by  the 
Orr-Sommerfeld  equation 


aR 


d} 

dz'^ 


4> 


d-Ujz) 

dz^ 


d  =  0, 


where  a  is  the  wave  number,  R  is  the  Reynolds’  number,  and  U {z)  is  the  velocity 
profile  of  the  basic  flow;  we  assume  plane  Poiseuille  flow  for  which  U{z)  =  1-z'^. 
Boundary  conditions  are  d  =  dd/dz  =  0  at  solid  boundaries  and,  for  boundary 
layer  flows,  ^  ~  1  as  2  -t  oo.  The  differential  equation  is  written  as  a  system 
of  four  first-order  equations  which,  when  integrated  using  the  trapezoidal  rule, 
yields  a  generalized  EVP 

K{a,R)u  =  s^K{a,R),  K{a,R)  =  A{a,R)  -  XB, 


where  A{a,  R)  is  complex  and  block  bidiagonal  with  4x4  blocks-.  Further  de¬ 
scription  of  the  problem  formulation  can  be  found  in  [11]. 

We  compute  the  neutral  curve,  i.e.  the  locus  of  points  in  the  (a,i?)-plane  for 
which  Im{c(a,  i?)}  =  0,  using  the  Newton-bsised  procedures  of  the  last  section. 
The  resulting  algorithm  has  convergence  rate  3.56  [25].  We  use  a  grid  with  5000 
points  for  which  A  is  of  order  n  =  20000.  A  portion  of  the  neutral  curve  is 
plotted  in  Figure  3. 


Re 

Fig.  3.  Complex  banded  nonsymmetric  eigensolver;  Neutral  stability  diagram 
for  Poiseuille  flow. 

The  significant  degree  of  vectorization  obtained  using  wrap-around  parti¬ 
tioning  enables  us  to  consider  highly  refined  discretizations.  Additional  results, 
including  consideration  of  a  Blasius  velocity  profile,  can  be  found  in  [11]. 
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Abstract.  In  recent  times  the  work  on  networks  of  processors  has  be¬ 
come  very  important,  due  to  the  low  cost  and  the  a^'ailability  of  these 
systems.  This  is  why  it  is  interesting  to  study  algorithms  on  networks 
of  processors.  In  this  paper  we  study  on  networks  of  processors  different 
Eigenvalue  Solvers.  In  particular,  the  Power  method,  deflation,  Givens  al¬ 
gorithm,  Davidson  methods  and  Jacobi  methods  are  analized  using  PVM 
and  MPI.  The  conclusion  is  that  the  solution  of  Eigenvalue  Problems  can 
be  accelerated  by  using  networks  of  processors  and  typical  parallel  al¬ 
gorithms,  but  the  high  cost  of  communications  in  these  systems  gives  rise 
to  small  modifications  in  the  algorithms  to  achieve  good  performance. 


1  Introduction 

Within  the  different  platforms  that  can  be  used  to  develop  parallel  algorithms, 
in  recent  years  special  attention  is  being  paid  to  networks  of  processors.  The 
main  reasons  for  their  use  are  the  lesser  cost  of  the  connected  equipment,  the 
greater  availability  and  the  additional  utility  as  a  usual  means  of  work.  On  the 
other  hand,  communications,  the  heterogeneity  of  the  equipment,  their  shared 
use  and,  generally,  the  small  number  of  processors  used  are  the  negative  factors. 

The  biggest  difference  between  multicomputers  and  networks  of  processors 
is  the  high  cost  of  communications  in  the  networks,  due  to  the  small  bandwidth 
and  the  shared  bus  wdiich  allows  us  to  send  only  one  message  at  a  time.  This 
characteristic  of  networks  makes  it  a  difficult  task  to  obtain  acceptable  efficien¬ 
cies.  and' also  lets  one  think  of  the  design  of  algorithms  with  good  performances 
on  a  small  number  of  processors,  more  than  of  the  design  of  scalable  algorithms. 
So,  despite  not  being  as  efficient  as  supercomputers,  netw-orks  of  proce,ssors 
come  up  as  a  new  enidronment  to  the  development  of  parallel  algorithms  with 
a  good  ratio  cost/efficiency.  Some  of  the  problems  that  the  netw'orks  have  can 
be  overlooked  using  faster  networks,  better  algorithms  and  new-  environments 
appropiate  to  the  network  features. 

*  Partially  supported  by  Gomision  Interministerial  de  Ciencia  y  Tecnologi'a,  project 
TIC96-1062-C03-02.  and  Consejen'a  de  Cultura  y  Educacion  de  Murcia,  Direccion 
General  de  Universidades.  project  GOM-18/96  MAT. 
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The  most  used  matricial  libraries  (BLAS.  LAPACK  [1].  ScaLAPAC'K  [2]) 
are  not  implemented  for  those  environments,  and  it  .seems  useful  to  programme 
these  linear  algebra  libraries  over  networks  of  processors  [3.  4,  5.  6.  7],  The 
implementation  of  these  algorithms  can  be  done  over  programming  environments 
like  PVM  [8]  or  MPI  [9],  which  makes  the  work  easier,  although  they  do  not 
take  advantage  of  all  the  power  of  the  equipment.  The  communications  libraries 
utilized  have  been  the  free  access  libraries  PVM  version  3.4  and  MPICH  [10] 
(which  is  an  implementation  of  MPI)  version  1.0.11. 

The  results  obtained  using  other  systems  or  libraries  could  be  different,  but 
we  are  interested  in  the  general  behaviour  of  network  of  proce.ssors,  and  more 
particularly  Local  Area  Networks  (LANs),  when  solving  Linear  ,\lgebra  Prob¬ 
lems.  Our  intention  is  to  design  a  library  of  parallel  linear  algebra  routines  for 
LANs  (we  could  call  this  library  LANLAPACK).  and  we  are  working  in  Lin¬ 
ear  System  Solvers  [11]  and  Eigenvalue  Solvers.  In  this  paper  some  preliminary 
studies  of  Eigenvalue  Solvers  are  shown.  These  problems  are  of  great  interest 
in  different  fields  in  science  and  engineering,  and  it  is  possibly  better  to  solve 
them  with  parallel  programming  due  to  the  high  cost  of  computation  [12].  The 
Eigenvalue  Problem  is  still  open  in  parallel  computing,  where  it  is  necessary 
to  know  the  eigenvalues  efficiently  and  exactly.  For  that,  we  have  carried  out 
a  study  on  the  development  of  five  methods  to  calculate  eigenvalues  over  two 
different  environments:  PVM  and  MPI,  and  using  networks  with  both  Ether¬ 
net  and  Fast-Ethernet  connections.  Experiments  have  been  performed  in  four 
different  systems: 

-  A  network  of  5  Sl'N  Ultra  1  140  with  Ethernet  connections  and  32  Mb  of 
memory  on  each  processor. 

-  A  network  of  7  SUN  Sparcstation  with  Ethernet  connections  and  8  Mb  of 
memory  on  each  processor. 

-  A  network  of  12  PC  486,  with  Ethernet  connections,  a  memory  of  8  Mb  on 
each  processor,  and  using  Linux. 

-  A  network  of  6  Pentiums,  with  Fast-Ethernet  connections,  a  memory  of  32 
Mb  on  each  processor,  and  using  Linux. 

In  this  paper  we  will  call  these  systems  SUNUltra,  SUNSparc,  P(.'486  and  Pen¬ 
tium,  respectively. 

The  approximated  cost  of  floating  point  operations  working  with  double  pro- 
cision  numbers  and  the  theoretical  cost  of  communicating  a  double  precision 
number,  in  the  four  systems,  is  shown  in  table  1.  The  arithmetic  cost  has  been 
obtained  with  medium  sized  matrices  (stored  in  main  memory).  Also  the  quo¬ 
tient  of  the  arithmetic  cost  with  respect  to  the  communication  cost  is  shown. 
These  approximations  are  presented  to  show  the  main  characteristics  of  the  four 
systems,  but  they  will  not  be  used  to  predict  execution  times  because  in  net¬ 
works  of  proce.ssors  many  factors,  which  are  difficult  to  measure,  influence  the 
execution  time:  collisions  when  accessing  the  bus,  other  users  in  the  system, 
assignation  of  processes  to  processors  ...  The  four  systems  have  also  different 
utilization  characteristics:  Sl^NTHtra  can  be  isolated  to  obtain  results,  but  the 
other  three  are  shared  and  it  is  more  difficult  to  obtain  representative  results. 
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Table  1.  Comparison  of  arithmetic  and  communication  costs  in  the  four  systems  uti 
lized. 


floating  point  cost 

word  —  sending  time 

quotient 

SU  NLUtra 

0.025  ps 

0.8  ps 

32 

SU  tV  .Sparc 

0.35  ps 

0.8  ps 

2.28 

PCASG 

0.17  ps 

0.8  ps 

4.7 

Pent  turn 

0.062  ps 

0.08  ps 

1.29 

given  I'o 
FOR  ?■  =  1.2,... 

r,  =  Av,-i 
/3i  =11  r,  Hoc. 

ENDFOR 


Fig.  1.  Scheme  of  the  sequential  Power  method. 


2  Eigenvalue  Solvers 

Methods  of  partial  resolution  (the  calculation  of  some  eigenvalues  and/or  ei¬ 
genvectors)  of  Eigenvalue  Problems  are  studied;  the  Power  method,  deflation 
technique,  Givens  algorithm. and  Davidson  method;  and  also  the  Jacobi  method 
to  compute  the  complete  spectrum.  We  are  interested  in  the  parallelization  of 
the  methods  on  networks  of  processors.  Mathematical  details  can  be  found  in 
many  books  ([13,  14,  15,  16]). 

Power  method.  The  Power  method  is  a  very  simple  method  to  compute  t  he 
eigenvalue  of  biggest  absolute  value  and  the  associated  eigenvector.  Some  \nri- 
ations  of  the  method  allow  us  to  compute  the  eigenvalue  of  lowest  absolute  value 
or  the  eigenvalue  nearest  to  a  given  number.  This  method  is  too  slow  to  be  con¬ 
sidered  as  a  good  method  in  general,  but  in  some  cases  it  can  be  useful.  In  spite 
of  the  bad  behaviour  of  the  method,  it  is  very  simple  and  will  allow  us  to  begin 
to  analyse  Eigenvalue  Solvers  on  networks  of  processors. 

A  scheme  of  the  algorithm  is  shown  in  figure  1.  The  algorithm  works  by 
generating  a  succession  of  vectors  u;  convergent  to  an  eigenvector  (ji  associated 
to  the  eigenvalue  Aj,  as  well  as  another  succession  of  values  3^;  convergent  to  the 
eigenvalue  Aj.  The  speed  of  convergency  is  proportional  to 

Each  iteration  in  the  algorithm  has  three  parts.  The  most  expensive  is  the 
multiplication  matrix-vector,  and  in  order  to  parallelize  the  method  the  atten¬ 
tion  must  be  concentrated  on  that  operation.  In  the  parallel  implementation,  a 
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master: 


given  i>o 
FOR  ?  =  1.2,... 


broadcast  v,-i  to  the  slaves 

receive  from  the  slaves,  and  form  i’, 


/3,  =11  r, 


compute  norm 

broadcast  norm  to  the  slaves 
IF  convergence  not  reached 


ENDIF 


ENDFOR 


slave  k,  with  fc  =  0, 1, . . .  ,p  -  1 : 
FOR  1  =  1.2.... 


receive  is-i  from  master 

send  to  master 
receive  norm  from  master 


ENDFOR 


Fig.  2.  Scheme  of  the  parallel  Power  method. 


ma.ster-slave  scheme  has  been  carried  out.  Matrix  .4  is  considered  distributed 
between  the  slave  processes  in  a  block  striped  partition  by  rows  [17].  A  possible 
scheme  of  the  parallel  algorithm  is  shown  in  figure  2.  The  multiplication  matrix- 
vector  is  performed  by  the  slave  processes,  but  the  master  obtains  d,-  and  forms 
Vj.  These  two  operations  of  cost  0[n)  could  be  done  in  parallel  in  the  slaves,  but 
it  would  generate  more  communications,  which  are  very  expensive  operations  in 
netw’orks  of  processors. 

The  arithmetic  cost  of  the  parallel  method  is:  — 1-  iV/i  flops,  and  the  the¬ 

oretical  efficiency  is  1009(. 

The  cost  of  communications  varies  with  the  way  in  which  they  are  performed. 
When  the  distribution  of  vector  r,-  and  the  norm  is  performed  using  a  broadcast 
operation,  the  cost  of  communications  per  iteration  is:  2rt(p) +  dt.(p)(u  +  1) -t- '-f 
dn.  where  t  and  d  represent  the  start-up  and  the  word-sending  time,  respect¬ 
ively.  and  Tt,(p]  and  i3h{p)  the  start-up  and  the  word-sending  time  when  using 
a  broadcast  operation  on  a  .system  with  p  slaves.  If  the  broadcasts  are  replaced 
by  point  to  point  communications  the  cost  of  communications  per  iteration  is 
t{'2p+  1 )  -I-  d{pn  -b  n+p) .  Which  method  of  communication  is  preferable  depends 
on  the  characteristics  of  the  environment  (the  communication  library  and  the 
network  of  ]Di’oces.sors)  we  are  using. 
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The  parallel  algorithm  is  theoretically  optimum  if  we  only  consider  efficiency, 
but  when  studying  scalability,  the  isoefficiency  function  is  n  a  and  the  scalab¬ 
ility  IS  not  very  good.  This  bad  scalability  is  caused  by  the  use  of  a  shared  bus 
which  avoids  sending  data  at  the  same  time.  Also  the  study  of  scalability  is  not 
useful  in  this  type  of  system  due  to  the  reduced  number  of  processors. 

We  will  experimentally  analyse  the  algorithm  in  the  most  and  the  least  ad¬ 
equate  systems  for  paraliel  processing  (Pentium  and  SbNldtra.  i-espectivel\  ). 
In  table  2  the  execution  time  of  the  sequential  and  the  parallel  algorithms  on 
the  two  systems  is  shown,  varying  the  number  of  sla\  es  and  the  matrix  size. 
Times  have  been  obtained  for  random  matrices,  and  using  PVM  and  the  routine 
pvnuncast  to  perform  the  broadcast  in  figure  2.  The  results  greatly  differ  in  the 
two  systems  due  to  the  big  difference  in  the  proportional  cost  of  arithmetic  and 
communication  operations.  Some  conclusions  can  be  obtained: 

-  Comparing  the  execution  time  of  the  sequential  and  the  parallel  algorithms 
using  one  slave,  we  can  see  the  very  high  penalty  of  communications  in  these 
systems,  especially  in  SUNTltra. 

-  The  best  execution  times  are  obtained  with  a  reduced  number  of  processors 
because  of  the  high  cost  of  communications,  this  number  being  bigger  in 
Pentium,  The  best  execution  time  for  each  system  and  matrix  size  is  marked^ 
in  table  2. 

-  The  availability  of  more  potential  memory  is  an  additional  benefit  of  parallel 
processing,  because  it  allows  us  to  solve  bigger  problems  without  swapping. 
For  example:  the  parallel  algorithm  in  SUNUItra  is  quicker  than  the  sequen¬ 
tial  algorithm  only  when  the  matrix  size  is  big  and  the  sequential  execution 

produces  swapping.  ■  i  i  ■ 

-  The  use  of  more  processes  than  processors  produces  in  Pentium  with  big 
matrices  better  results  than  one  process  per  processor,  and  this  is  because 
communications  and  computations  are  better  ovei lapped. 

The  basic  parallel  algorithm  (figure  2)  is  very  simple  but  it  is  not  optimized 
for  a  network  of  processors.  We  can  try  to  improve  communications  in  at  least 
two  ways: 

-  The  broadcast  routine  is  not  optimized  for  networks,  and  it  could  be  better 
if  W6*  r^plscc  this  routine  by  point  to  point  coniniuniccitions. 

-  The  diffusion  of  the  norm  and  the  vector  can  be  assembled  in  only  one 
communication.  In  that  way  more  data  are  transferred  because  the  last  vector 
need  not  be  .sent,  but  less  communications  are  performed. 

In  table  3  the  execution  time  of  the  basic  version  (version  1),  the  version  with 
point  to  point  communications  (version  2),  and  the  version  where  the  diffusion  of 
norm  and  vector  are  assembled  (version  3)  are  compared.  Versions  2  and  3  reduce 
the  execution  time  in  both  systems,  and  the  reduction  is  clearer  in  SUNldtia 
because  of  the  bigger  cost  of  communication  in  this  system. 

Until  now,  the  results  showm  have  been  those  obtained  with  PVM.  but  the 
use  of  MPI  produces  better  results  (obviously  it  depends  on  the  versions  we 
are  using).  In  table  4  the  execution  time  obtained  with  the  basic  version  of  the 
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Table  2.  Execution  time  (in  seconds)  of  the  Power  method  using  PVM,  varying  the 
number  of  processors  and  the  matrix  size. 


[sequential  1 1  slave|2  slaves|3  slaves|4  slavesjS  slaves|6  slavesj?  slaves|8  slavesj 

SUNUltra 

Rilil 

0.289 

0.381 

0.547 

0.526 

0.552 

0.6.38 

BTiTil 

0.491 

0.838 

0.849 

0.054 

Eilil 

0.599 

1.065 

1.491 

1.603 

4.211 

7.582 

1.426 

1.722 

1.940 

2.004 

23.613 

54.464 

1.481 

1.796 

1.901 

2.233 

2.324 

2.421 

Pentium 

mm 

0.172 

0.281 

0.250 

0.188 

0.292 

0.323 

0.407 

0.405 

0.436 

0.639 

0.608 

0.682 

0.599 

0.662 

0.656 

Blilil 

1.272 

0.908 

0.892 

0.842 

0.891 

1.171 

milil 

1.750 

1.568 

1.224 

1.275 

1.231 

1.169 

mm 

3.368 

1.904 

1.911 

Table  3.  Comparison  of  the  execution  time  of  the  Power  method  using  P\'M,  with 
different  communication  strategy. 


2  slaves  1 

3  slaves 

4  slaves 

■HI 

SUNUltra 

mrm 

IQI^ 

0.262 

niPEHi 

0.339 

mm\ 

||iQgQ 

fiimn 

HESS] 

IQggg] 

QQQ 

Bliia 

umij 

[iBililsJ 

1^31 

moi 

1.089 

1.045 

IIHBI 

0.949 

0.911 

Pentium 

iEiIiBI 

fnreta 

imi 

IQjHIg] 

riWiirZl 

0.585 

0.599 

1 0.608 

0.603 

0.557 

inrag 

1.090 

UB2EJ 

0.985 

programme  using  MPI  on  SUNUltra  is  shown.  Comparing  this  table  with  table  2 
we  can  see  the  programme  w'ith  MPI  works  better  when  tlie  number  of  processors 
increases. 

Deflation  technique.  Deflation  technique  is  used  to  compute  tlie  next  eigen¬ 
value  and  its  associated  eigenvector  starting  from  a  ])reviously  known  one.  This 
technique  is  used  to  ol^taiii  some  eigenvalues  and  it  is  based  on  the  transform¬ 
ation  of  the  initial  matrix  to  another  one  that  has  got  the  same  eigenvalues, 
replacing  A]  by  zero. 

To  compute  numEV  eigenvalues  of  the  matrix  A.  the  deflation  technique  can 
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Table  4.  Execution  time  (in  seconds)  of  the  Power  method  using  MPl,  varying  the 
number  of  processors  and  the  matrix  size,  on  SUNUltra. 


1  slave 

2  slaves 

3  slaves 

4  slaves 

5  slaves 

6  slaves 

.300 

0.15 

0.14 

0.22 

0.28 

HQ 

i&SI 

iBiBl 

3.58 

1500 

22.16 

1.37 

0.97 

0.99 

1.06 

0.94 

Ai  =  A 

FOR  i  =  1.2 . niirnEV 

compute  by  the  Power  method  At  and  g, 

update  matrix  A:  4,+i  =  + 

(■) 

compute  q,  from  q, 

ENDFOR 


Fig.  3.  Scheme  of  the  deflation  technique. 


be  used  performing  numEV  steps  (figure  3),  computing  in  each  step,  using  the 
Power  method,  the  biggest  eigenvalue  (A,-)  and  the  corresponding  eigenvector 
of  a  matrix  .4,-,  with  .4i  =  .4.  Each  matrix  is  obtained  from  matrix 
.4,-  using  qj'\  which  is  utilized  to  form  matrix  5,+i: 


'1 

0  • 

•  0 

-7i 

0  • 

•0‘ 

0 

i  ■ 

•  0 

-(/2 

0  • 

■  0 

0 

0  • 

■  1 

-?A-1 

0  • 

•  0 

0 

0  ■ 

•  0 

0 

0  ■ 

■  0 

0 

0  ■ 

•  0 

1  • 

•  0 

.0 

0  ■ 

0 

-Qn 

0  • 

•  1_ 

where  (/I'*  =  (9i . 1' 9a+! . 9'i)- 

The  eigenvalue  A,-  is  the  biggest  eigenvalue  in  absolute  value  of  matrix  .4,. 
and  it  is  also  the  ?-th  eigenvalue  in  absolute  value  of  matrix  .4.  The  eigenvector 


qi  associated  to  the  eigenvalue  A,  in  matrix  .4  is  computed  from  the  eigenvector 
q^P  associated  to  A,-  in  the  matrix  .4,.  This  computation  is  performed  repeatedly 


applying  the  formula  q) 


\u  )  _ 


•if*"  + 


<1, 


^  I  ”  tf 


I  tr  i 


where  al. 


IS 


the  ^--th 
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Table  5.  Execution  time  (in  seconds)  of  the  deflation  method  using  PVM,  varying  the 
number  of  processors  and  the  matrix  size,  when  computing  5%  of  the  eigenvalues. 


|sequential|  1  slave|2  slavesj.'i  slave.s|4  slaves|5  slaves 

.SUNUitra 

BiHl 

11.4 

28.5 

:32.4 

600 

184.7 

308.3 

MgflB 

273.2 

900 

91-1.4 

12.5S.3 

862.5 

8-26.6 

765.7 

831,0 

Pentium 

iBBil 

■25.0 

■29.9 

23,0 

22.4 

22.2 

600 

407,0 

2.39.5 

195.8 

194,6 

993,7 

column  of  matrix  .4,,,.  with  k  the  index  where  ^  —  1. 

The  most  costly  part  of  the  algorithm  is  the  application  of  the  Power  metliod, 
wliich  has  a  cost  of  order  O  [n-]  per  iteration.  Therefore,  the  previously  ex¬ 
plained  Power  method  can  be  applied  using  a  scheme  master-slave.  The  cost  of 
the  deflation  part  (update  matrix)  is  2n”  flops,  and  the  cost  of  the  computation 
of  iji  depends  on  the  step  and  is  ^n{i  -  1)  flops.  These  two  parts  can  be  per¬ 
formed  simultaneously  in  the  parallel  algorithm:  the  master  process  computes  ijj 
while  the  slaves  processes  update  matrix  .4,.  The  deflation  part  of  the  parallel 
algorithm  (update  matrix  and  compute  9,  )  is  not  scalable  if  a  large  number  of  ei¬ 
genvalues  are  computed.  As  we  have  previously  mentioned,  scalability  is  not  very- 
important  in  these  types  of  systems.  In  addition,  this  method  is  used  to  compute 
only  a  reduced  number  of  eigenvalues,  due  to  its  high  cost  when  computing  a 
large  number  of  them. 

In  table  .5  the  execution  time  of  the  sequential  and  parallel  algorithms  are 
shown  on  SUNUitra  and  Pentium,  using  PVM  and  the  basic  version  of  the  Power 
method.  Ciompared  with  table  2,  we  can  see  the  behaviour  of  the  parallel  Power 
and  deflation  algorithms  is  similar,  but  that  of  the  deflation  technique  is  better, 
due  to  the  high  amount  of  computation,  the  work  of  the  master  processor  in  the 
deflation  part  of  the  execution  and  the  distribution  of  data  between  ]:)roce.sses  in 
the  parallel  algorithm. 

Davidson  method.  The  Power  method  is  a  very  simple  but  not  very  useful 
method  to  compute  the  biggest  eigenvalue  of  a  matrix.  Other  more  efficient 
methods,  as  for  example  Davidson  methods  [18]  or  Conjugate  Gradient  mcihods 
[19,  20],  can  be  used. 

The  Davidson  algorithm  lets  us  compute  the  highest  eigenvalue  (iii  absolute 
value)  of  a  matrix,  though  it  is  esj^ecially  suitable  for  large,  sparse  and  s\-mmetric 
matrices.  The  method  is  valid  for  real  as  well  as  complex  matrices,  ll  works  b>' 
building  a  sequence  of  .search  subsi^aces  that  contain,  in  the  limit,  the  desired 
eigenvector.  .4t  the  same  time  these  subspaces  are  built,  so  apiJroximations  to 
the  desired  eigenvector  in  the  current  subspace  are  also  built. 
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Vo  =  [] 
given  !'] 
k  =  ] 

WHILE  convergence  not  reached 

1 1  =  [Vk_i 

orthogonalize  V/,  using  modified  Grajn^Schmith 
compute  Hk  =  V;^.4V),. 

compute  the  highest  eigenpair  IHk.yk)  of  Hk 
obtain  the  Ritz  vector  Uk  =  Vkl/k 
compute  the  residual  Vk  =  Auk  -  dktn- 
IF  k  —  kjriax 

reinitial ize 

ENDIF 

obtain  Vk+i  =  [Ski  —  D)  I'k 
k=k+ I 
ENDWHILE 


Fig.  4.  Scheme  of  the  sequential  Davidson  method. 


Figure  4  shows  a  scheme  of  a  sequential  Davidson  method.  In  successive 
steps  a  matrix  Vk  with  k  orthogonal  column  vectors  is  formed.  .4fter  that,  mat¬ 
rix  Hk  =  Vk-^k'k  is  formed  and  the  biggest  eigenvalue  in  absolute  value  {dk) 
and  its  associated  eigenvector  [yk)  are  computed.  This  can  be  done  using  the 
Power  method,  because  matrix  Hk  is  of  size  k  x  k  and  k  can  be  kept  small 
using  some  reinitialization  strategy  (when  k  =  kwar  proce.ss  is  leinitial- 
ized).  The  new  vector  Vk  +  \  to  be  added  to  the  matrix  Vk  to  form  \>  +  i  can  be 
obtained  with  the  succession  of  operations:  Uk  =  Vkyk-  =  Auk  -  (h-iik  and 
Vk+i  =  (dkl  -  D)~^  Vk.  with  D  the  diagonal  matrix  which  has  in  the  diagonal 
the  diagonal  elements  of  matrix  .4. 

To  obtain  a  parallel  algorithm  the  cost  of  the  different  parts  in  the  sequen¬ 
tial  algorithm  can  be  analysed.  The  only  operations  with  cost  0  {n~)  are  two 
matrix-vector  multiplications:  AVk  in  the  computation  of  Hk  and  Aitk  in  the 
computation  of  the  residual.  .41  >  can  be  accomplished  in  order  O  (n  )  because 
it  can  be  decomposed  as  [.4Va._i |.4ci,.].  and  AVk-\  was  computed  in  the  previ¬ 
ous  step.  The  optimum  value  of  varies  with  the  matrix  size  and  the  type 
of  the  matrix,  but  it  is  small  and  it  is  not  worthwhile  to  parallelize  the  other 
parts  of  the  sequential  algorithm.  Therefore,  the  parallelization  of  the  Davidson 
method  is  done  basically  in  the  same  way  as  the  Power  method:  parallelizing 
matrix- vector  multiplications. 

This  method  has  been  parallelized  using  a  master-slave  scheme  (figure  5). 
working  all  the  processes  in  the  two  parallelized  matrix-vector  multiplications, 
and  performing  the  master  process  non  parallelized  operations.  In  that  wav. 


175 


FEUP  •  Faculdade  de  Engenharia  da  Universidade  do  Porto 


master : 

'o  =  D 

given  t'l 
k  =  1 

WHILE  convergence  not  reached 
Vt  =  [Va._i  I !>),.] 

orthogonalize  \l  using  modified  Gram-Schmith 
send  Vk  to  slaves 

in  parallel  compute  Avk  and  accumulate  in  the  master 
compute  Hk  =  Vk  {AVk) 

compute  the  highest  eigenpair  (6k.yk)  of  Hk 
obtain  the  Ritz  vector  m  =  \'kyk 
send  Ilk  to  slaves 

in  parallel  compute  Am  auid  accumulate  in  the  master 
compute  the  residual  I'k  =  Avk  —  OkUk 
IF  k  —  krnax 

reinitialize 

END  IF 

obtain  i’a  +  i  =  {Okl  —  D)~'  ik 
k  =  kAl 
ENDWHILE 

slave  k ,  with  k  =  I . P  — 

WHILE  convergence  not  reached 
receive  Vk  from  master 

in  parallel  compute  Avk  and  accumulate  in  the  master 
receive  Uk  from  master 

in  parallel  compute  Aitk  and  accumulate  in  the  master 
IF  k  —  kmax 

reinitialize 

END  IF 

k  =  k+l 
ENDWHILE 


Fig.  5.  Scheme  of  the  parallel  Davidson  method. 


matrix  A  is  distributed  between  the  processes  and  the  result  of  the  local  matrix- 
vector  multiplications  is  accumulated  in  the  master  and  distributed  from  the 
master  to  the  other  processes. 

Because  the  operations  parallelized  are  matrix-vector  multiplications,  as  in 
the  Power  method,  the  behaviour  of  the  parallel  algorithms  must  be  similar,  but 
better  results  are  obtained  with  the  Davidson  method  because  in  this  case  the 
master  process  works  in  the  multiplications. 

Table  6  show's  the  execution  time  of  50  iterations  of  the  Davidson  method 
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Table  6.  Execution  time  of  the  Da^■idson  method  using  MPI.  varying  the  number  of 
processors  and  the  matrix  size,  on  PC486. 


sequ 

p=2 

p=3 

p=:4 

p=5 

p=6 

p=7 

p=8 

300 

0.048 

0.051 

0.048 

0.045 

0.043 

0.041 

0.041 

0.042 

600 

0.129 

0.097 

0.083 

0.076 

0.072 

0.065 

0.064 

900 

0.1G5 

0.131 

0.118 

0.119 

0.112 

for  symmetric  complex  matrices  on  PC486  and  using  MPI,  varying  the  number 
of  processors  and  the  matrix  size. 

Givens  algorithm  or  bisection.  As  we  have  seen  in  previous  paragraphs,  on 
parallel  Eigenvalue  Solvers  whose  cost  is  of  order  O  («-')  only  a  small  reduction 
in  the  execution  time  can  be  achieved  in  networks  of  processors  in  some  cases, 
when  the  matrices  are  big  or  the  quotient  between  the  cost,  of  communication 
and  computation  is  small. 

In  some  other  Eigent'alue  Solvers  the  behaviour  is  slightly  better.  For  ex¬ 
ample.  the  bisection  method  is  an  iterative  method  to  compute  eigenvalues  in 
an  interval  or  the  k  Inggest  eigenvalues.  It  is  applicable  to  symmetric  tridiag- 
onal  matrices.  This  method  is  especially  suitable  to  be  parallelized,  due  to  the 
slight  communication  between  processes,  which  factor  increases  the  total  time 
consumed  in  a  network.  When  computing  the  eigenvalues  in  an  interval,  the 
interval  is  divided  in  subintervals  and  each  process  works  in  the  computation 
of  the  eigenvalues  in  a  subinterval.  When  computing  the  k  biggest  eigenvalues, 
each  slave  knows  the  number  of  eigenvalues  it  must  compute.  After  that,  com¬ 
munications  are  not  nece.ssary  but  inbalance  is  produced  by  the  distiibution  of 
the  spectrum.  More  details  on  the  parallel  bisection  method  are  found  in  [21]. 

The  eigenvalues  are  computed  by  the  processes  performing  successive  itera¬ 
tions.  and  each  iteration  has  a  cost  of  order  0{n).  Despite  the  low  computational 
cost  good  performance  is  achieved  because  communications  are  not  necessar\' 
after  the  subintervals  are  broadcast.  Table  7  shows  the  efficiency  obtained  using 
this  method  to  calculate  all  the  eigenvalues  or  only  20%  of  them,  on  SUNUltra 
and  SUNSparc  for  matrix  size  100.  The  efficiencies  are  clearly  better  than  in  the 
previous  algorithms,  even  with  small  matrices  and  execution  time. 

Jacobi  method.  The  Jacobi  method  for  solving  the  Symmetric  Eigenvalue  Prob¬ 
lem  works  by  performing  successive  sweeps,  nullifying  once  on  each  sweep  the 
n(ii  —  l)/2  nondiagonal  elements  in  the  lower-triangular  part  of  the  matrix. 

It  is  possible  to  design  a  Jacobi  method  considering  the  matrix  .4  of  size 
n  X  ?)..  dividing  it  into  blocks  of  .size  1  x  f  and  doing  a  sweep  on  these  Idocks. 
regarding  each  block  as  an  element.  The  blocks  of  size  1  x  t  are  grouped  into 
blocks  of  size  2kt  x  2kt  and  these  blocks  are  assigned  to  the  processors  in  such 
a  way  that  the  load  is  balanced.  Parallel  block  Jacobi  methods  are  explained  in 
more  detail  in  [22]. 
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Table  7.  Efficiency  of  the  Git’ens  method  using  PVM,  varying  the  nainirer  of  proces¬ 
sors,  on  SUNUltra  and  SUNSparc,  with  matrix  size  100. 


[EBQ5B9SBI 

1  all  the  eigenvalues  \ 

mmuBammim 

SUNUltra 

0.55 

0.49 

0.52 

0..38 

0.26 

0.19 

SUNSparc 

0.94 

0.60 

0.64 

0.81 

0.61 

0..36 

0.37 

Table  8.  Theoretical  speed-up  of  the  Jacobi  method,  \arying  the  number  of  processes 
and  processors. 


p  =  2 

p  =  3 

■rr 

11 

p  =  5 

p  =  6 

p  =  7 

p  =  8 

p  =  9 

p  =  10 

2 

o 

2 

2 

2 

2 

2 

2 

2 

6 

3 

3 

4.5 

4.5 

4.5 

4.5 

4.5 

4.5 

10 

4 

4 

5.3 

5.3 

s 

8 

8 

In  order  to  obtain  a  distribution  of  da.ta  to  the  ]5roces.sors.  a.n  algorithm  for 
a  logical  triangular  mesh  can  be  used.  The  blocks  of  size  2ki  x  '2kf  must  be 
a.ssigned  to  the  processors  in  such  a  way  that  the  work  is  balanced.  Because  the 
most  costly  part  of  the  execution  is  the  updating  of  the  matrix,  and  nondiagonal 
blocks  contain  twice  more  elements  to  be  nullified  than  diagonal  blocks,  the  load 
of  nondiagonal  blocks  can  be  considered  twice  the  load  of  diagonal  blocks. 

Table  8  shows  the  theoretical  speed-up  of  the  method  when  logical  meshes  of 
fi,  6  or  10  proce.sses  are  assigned  to  a  network,  varying  the  number  of  jjroce.ssors 
in  the  network  from  2  to  10.  Higher  theoretical  speed-up  is  obtained  increasing 
the  number  of  processes.  This  produces  better  balancing,  but  also  more  com¬ 
munications  and  not  always  a  reduction  of  the  execution  time.  It  can  be  seen  in 
table  9,  where  the  execution  time  per  sweep  for  matrices  of  size  384  and  768  is 
shown,  va.rying  the  number  of  processors  and  processes.  The  shown  results  have 
been  obtained  in  PC486  and  SUNUltra  using  MPI,  and  the  good  behaviour  of 
the  parallel  algorithm  is  observed  because  a  relatively  large  number  of  processors 
can  be  used  reducing  the  execution  time. 


3  Conclusions 

Our  goal  is  to  develop  a  library  of  linear  algebra  routines  foi'  LANs.  In  this  paper 
some  previous  results  on  Eigenvalue  Solvers  are  shown.  The  ('haractei’istics  of  the 
environment  propitiate  small  modifications  in  the  algorithms  to  adapt  them  to 
the  system.  In  these  environments  we  do  not  generally  have  many^  processors, 
and  also,  when  the  number  of  processors  goes  up,  efficiency  unavoidably  goes 
down.  For  these  reasons,  when  designing  algorithms  for  networks  of  processors 
it  is  preferable  to  think  on  good  algorithms  for  a  small  number  of  j^rocessors. 
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Table  9.  Execution  time  per  sweep  of  the  Jacobi  method  on  PC486  and  Sl'M  Itra 
using  MPI. 


p=l|  p:^2|  p^3|  P=4|  P=5|  p=6|  p=<|  P=8|  P=9|p=ld] 


_ 

- 

1-  - 

SlLN'Ultra; 

384  (non  swapping) 

6.25 

3 

4.78 

4.55 

cT 

7.08 

7.31 

8.26 

— 

SUNUltra: 

768  (non  swapping) 

1“ 

.50.78 

.3 

28.37 

25.50 

6 

.39.51 

33.11 

32.18 

_ 

PC48G:  384  (non  swapping) 

n 

mm 

a 

mm 

S 

1^ 

fEB! 

■■n 

m 

iHIH 

IQQ] 

g2B 

B2Q 

— 

PC486: 

768  (swapping) 

698.8 

T 

217.4 

195.6 

6 

187.2 

142.1 

111.6 

104.6 

10 

1.58.9 

145.6 

106.5 

96.9 

83.5 

76.5 

72.5 

and  not  on  scalable  algorithms.  Because  of  the  great  influence  of  the  cost  of 
communications,  a  good  use  of  the  available  environments  -MPI  oi  P^  M-  is 
essential. 


References 

1.  E.  Anderson.  Z.  Bai.  C.  Bischof.  J.  Demniel,  J.  Dongarra.  J.  Du  Croz.  A.  Green- 
baum.  S.  Hamraarling,  A.  McKenney.  S.  Ostrouchov  and  D.  Sorensen.  LA  PACK 
Users  '  Guide.  SIAM,  1992. 

2.  ScaLAPACK  User's  Guide.  1996. 

3.  L.  S.  Blaclcford.  J.  Ghoi.  A.  Cleary.  E.  D’Azevedo.  J.  Demmel,  I.  Dhillon.  J.  Don¬ 
garra.  S.  Hanimarling.  G.  Henry.  .A.  Petitet,  K.  Stanley.  D.  l^alker  and  R.  t. 
Whaley.  ScaLAPACK:  A  Linear  Algebra  Library  for  Message-Passing  Comimters. 
In  Proceedings  oj  the  Eighth  SIAM  Conference  on  Parallel  Processing  for  Scientific 
Computing.  CD-ROM.  SL\M.  199;. 

4.  .].  Demmel  and  K.  Stanle>'.  The  Performance  of  Finding  Eigenvalue,s  and  Eigen¬ 
vectors  of  Dense  Symmetric  Matrices  on  Distributed  Memory  (.'oinputer.s.  In  D. 
H.  Bailey.  P.  E.  Bjorstad.  J.  R.  Gilbert.  M.  \'.  Mascagni.  R.  S.  Schreiber.  H.  D. 
Simon.  V.  J.  Torczon  and  L.  T.  Watson,  editor.  Proceedings  of  the  Seventh  SI.A.M 
Conference  on  Parallel  Processing  for  Scientific  Computing,  pages  >528- .5.1.3.  SIAM. 
1995. 


179 


FEIJP  -  Faculdade  de  Engenharia  da  Universidadc  do  Porto 


5.  D.  Gimenez.  M.  J.  Majado  and  I.  Verdii,  Solving  the  Symmetric  Eigenvalue  Prob¬ 
lem  on  Distributed  Memory  Systems.  In  H.  R.  Arabnia.  editor.  Prorrrding^  of  the 
Intevnational  Conference  on  Parallel  and  Distributed  Processing  Technique  s  and 
Applications.  PDPTA  '97.  pages  744-747.  1997, 

6.  Kuo-Ghan  Huang.  Feng-.Jian  Wang  and  Pei-C’hi  Wu.  Parallelizing  a  Level  3  BLAS 
Library  for  LAN-Connected  Workstations.  Journal  of  Parallel  and  Distributed  Coni- 
puting,  38:28-36.  1996. 

7.  Gen-Ching  Lo  and  Yousef  Saad.  Iterative  solution  of  general  sparse  linear  systems 
on  clusters  of  workstations.  May  1996. 

8.  Geist,  A.  Begelin.  J.  Dongarra,  W.  Jiang.  R.  Manchek  and  V.  Sunderam.  Parallel 
Virtual  Machine.  A  User's  Guide  and  Tutorial  for  Networked  Parallel  Computing.. 
The  MIT  Press,  1995. 

9.  Message  Passing  Interface  Forum.  A  Message-Passing  Interface  Standard.  Interna¬ 
tional  Journal  of  Supercomputer  Applications.  (  .3).  1994. 

10.  Users  guide  to  mpich.  preprint. 

11.  F.  J.  Garcia  and  D.  Gimenez.  Resolucion  de  sisicmas  triangulares  de  ecuaciones 
lineales  en  redes  de  ordenadores.  Facultad  de  Informatica.  Universidad  de  Murcia. 

1997. 

12.  A.  Edelman.  Large  dense  linear  algebra  in  1993:  The  parallel  computing  influence. 
The  International  Journal  of  Supercomputer  Applications.  7('2):113-V1S.  1993. 

13.  G.  H.  Golub  and  C.  F.  Van  Loan.  Matrix  Computations.  The  Johns  Hopkins 
University  Press,  1989.  Segunda  Edicion. 

14.  L.  N.  Trefethen  and  D.  Bau  111.  Numerical  Linear  Algebra.  SIAM,  1997. 

15.  David  S.  Watkins.  Matrix  Computations.  John  Wiley  k  Sons,  1991. 

16.  J.  H,  Wilkinson.  The  .Algebraic  Eigenvalue  Problem.  Clarendon  Press,  1965. 

17.  Kumar.  A.  Grama.  A.  Gupta  and  G.  Karypis.  Introduction  to  Parallel  Com¬ 
puting.  Design  and  Analysts  of  .Algorithms.  The  Benjamin  Cummings  Publishing 
Company,  1994. 

18.  E.  R.  Davidson.  The  Iterative  Calculation  of  a  Few  of  the  Lowest  Eigenvalues 
and  Corresponding  Eigenvectors  of  Large  Real-Symmetric  Matrices.  Journal  of 
Computational  Physics.  17:87-94.  1975. 

19.  W.  W.  Bradbury  and  R.  Fletcher.  New  Iterative  Methods  for  Solution  of  the 
Eigenproblem.  Numerische  Mathematik.  9:259-267,  1966. 

20.  -A..  Edelman  and  S.  T.  Smith.  On  conjugate  gradient-like  methods  for  eigenvalue¬ 
like  problems.  BIT.  36(3);494-508.  1996. 

21.  J.  M.  Badi'a  and  A.  M.  Vidal.  Exploiting  the  Parallel  Divide-and-Conquer  Method 
to  Solve  the  Symmetric  Tridiagonal  Eigenproblem.  In  Proceedings  of  the  Si.rth  Eur- 
omicro  Workshop  on  Parallel  and  Distributed  Processing.  Madrid.  January  21-2-3. 

1998. 

T'l.  D.  Gimenez.  V.  Hernandez  and  A.  M.  Vidal.  A  Unified  .Approach  to  Parallel 
Block-Jacobi  Methods  for  the  Symmetric  Eigenvalue  Problem.  In  Proceedings  of 
VECPAR'98.  1998. 


This  article  was  jjrocessed  using  the  DTgX  macro  package  with  LLNCS  style 


180 


VECPAR’98  -  3rd  International  Meeting  on  Vector  and  Parallel  Processing 


Parallel  Domain-Decomposition  Preconditioning 
for  Computational  Fluid  Dynamics 


Timothy  J.  BaxthS  Tony  F.  Chan^*,  and  Wei-Pai  Tang®** 

'  NASA  Ames  Research  Center,  Mail  Stop  T27A-1, Moffett  Field,  CA  94035,  USA 

barthCnas . nasa . gov 

2  UCLA  Department  of  Mathematics,  Los  Angeles,  CA  90095-1555,  USA 
chanQmath.acla.edu 

®  University  of  Waterloo  Department  of  Computer  Science,  Waterloo,  Ontario  N2L 
3G1,  Canada  wptangflbz .  uwaterloo .  ca 


Abstract.  Algebraic  preconditioning  algorithms  suitable  for  computa¬ 
tional  fluid  dynamics  (CFD)  based  on  overlapping  and  non-overlapping 
domain  decomposition  (DD)  are  considered.  Specific  distinction  is  given 
to  techniques  well-suited  for  time-dependent  and  steady-state  computa¬ 
tions  of  fluid  flow.  For  time-dependent  flow  calculations,  the  overlapping 
Schwarz  algorithm  suggested  by  Wu  et  al.  [28]  together  with  stabilized 
(upwind)  spatial  discretization  shows  excellent  scalability  and  parallel 
performance  without  requiring  a  coarse  space  correction.  For  steady- 
state  flow  computations,  a  family  of  non-overlapping  Schur  complement 
DD  techniques  are  developed.  In  the  Schur  complement  DD  technique, 
the  triangulation  is  first  partitioned  into  a  number  of  non-overlapping 
subdomains  and  interfaces.  A  permutation  of  the  mesh  vertices  based 
on  subdomains  and  interfaces  induces  a  natural  2x2  block  partitioning 
of  the  discretization  matrix.  Exact  LU  factorization  of  this  block  sys¬ 
tem  introduces  a  Schur  complement  matrix  which  couples  subdomains 
and  the  interface  together.  A  family  of  simplifying  techniques  for  con¬ 
structing  the  Schur  complement  and  applying  the  2x2  block  system  as 
a  DD  preconditioner  are  developed.  Sample  fluid  flow  calculations  are 
presented  to  demonstrate  performance  characteristics  of  the  simplified 
preconditioners. 


1  Overview 

The  efficient  numerical  simulation  of  compressible  fluid  flow  about  complex  ge¬ 
ometries  continues  to  be  a  challenging  problem  in  large  scale  computing.  Many 

*  The  second  author  was  partially  supported  by  the  National  Science  Foundation  grant 
ASC-9720257,  by  NASA  under  contract  NAS  2-96027  between  NASA  and  the  Uni¬ 
versities  Space  Research  Association  (USRA). 

**  The  third  author  was  partially  supported  by  NASA  under  contract  NAS  2-96027 
between  NASA  and  the  Universities  Space  Research  Association  (USRA),  by  a  Nat¬ 
ural  Sciences  and  Engineering  Research  Council  of  Canada  and  by  the  Information 
Technology  Research  Centre  which  is  funded  by  the  Province  of  Ontario. 
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computational  problems  of  interest  in  combustion,  turbulence,  aerod3mamic  per¬ 
formance  analysis  and  optimization  will  require  orders  of  magnitude  increases 
in  mesh  resolution  and/or  solution  degrees  of  freedom  (dofs)  to  adequately  re¬ 
solve  relevant  fluid  flow  features.  In  solving  these  large  problems,  issues  such  as 
algorithmic  scalability  ^  and  efficiency  become  fundamentally  important.  Fur¬ 
thermore,  current  computer  hardware  projections  suggest  that  the  needed  com¬ 
putational  resources  can  only  be  achieved  via  peirallel  computing  architectures. 
Under  this  scenario,  two  algorithmic  solution  strategies  hold  particular  promise 
in  computational  fluid  dynamics  (CFD)  in  terms  of  complexity  and  implemen¬ 
tation  on  parallel  computers:  (1)  multigrid  (MG)  and  (2)  domain  decomposition 
(DD).  Both  are  known  to  possess  essentially  optimal  solution  complexity  for 
model  discretized  elliptic  equations.  Algorithms  such  as  DD  are  particularly 
well-suited  to  distributed  memory  parallel  computing  architectures  with  high 
off-processor  memory  latency  since  these  algorithms  maintain  a  high  degree  of 
on-processor  data  locality.  Unfortunately,  it  remains  an  open  challenge  to  ob¬ 
tain  similar  optimal  complexity  results  using  DD  and/or  MG  algorithms  for 
the  hyperbolic-elliptic  and  hyperbolic-parabolic  equations  modeling  compress¬ 
ible  fluid  fl'ow.  In  the  remainder  of  this  paper,  we  report  on  promising  domain 
decomposition  strategies  suitable  for  the  equations  of  CFD.  In  doing  so,  it  is 
important  to  distinguish  between  two  types  of  flow  calculations: 

1.  Steady-state  computation  of  fluid  flow.  The  spatially  hyperbolic-elliptic  na¬ 
ture  of  the  equations  places  special  requirements  on  the  solution  algorithm. 
In  the  elliptic-dominated  limit,  global  propagation  of  deca3dng  error  infor¬ 
mation  is  needed  for  optimality.  This  is  usually  achieved  using  either  a  coarse 
space  operator  (multigrid  and  overlapping  DD  methods)  or  a  global  interface 
operator  (non-overlapping  DD  methods).  In  the  hyperbolic-dominated  limit, 
error  components  are  propagated  along  characteristics  of  the  flow.  This  sug¬ 
gests  specialized  coarse  space  operators  (MG  and  overlapping  DD  methods) 
or  special  interface  operators  (non-overlapping  DD  methods).  In  later  sec¬ 
tions,  both  overlapping  and  non-overlapping  DD  methods  are  considered  in 
further  detail. 

2.  Time-dependent  computation  of  fluid  flow.  The  hyperbolic-parabolic  nature 
of  the  equations  is  more  forgiving.  Observe  that  the  introduction  of  numer¬ 
ical  time  integration  implies  that  error  information  can  only  propagate  over 
relatively  small  distances  during  a  given  time  step  interval.  In  the  context 
of  overlapping  DD  methods  with  backward  Euler  time  integration,  it  be¬ 
comes  mathematically  possible  to  show  that  scalability  is  retained  without 
a  coarse  space  correction  by  choosing  the  time  step  smaJl  enough  and  the 
subdomain  overlap  sufficiently  large  enough,  cf.  Wu  et  al.  [28].  Section  5.3 
reviews  the  relevant  theory  and  examines  the  practical  merits  by  performing 
time-dependent  Euler  equation  computations  using  overlapping  DD  with  no 
coarse  space  correction. 


^  the  arithmetic  complexity  of  algorithms  with  increasing  number  of  degrees  of  freedom 
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2  Scalability  and  Preconditioning 

To  understand  algorithmic  scalability  and  the  role  of  preconditioning,  we  think 
of  the  partial  differential  equation  (PDE)  discretization  process  as  producing 
linpar  or  linearized  systems  of  equations  of  the  form 

Ax  —  b  =  0  (1) 

where  A  is  some  large  (usually  sparse)  matrix,  6  is  a  given  right-hand-side  vec¬ 
tor,  and  X  is  the  desired  solution.  For  many  practical  problems,  the  amount  of 
arithmetic  computation  required  to  solve  (1)  by  iterative  methods  can  be  esti¬ 
mated  in  terms  of  the  condition  number  of  the  system  k(A).  If  >1  is  symmetric 
positive  definite  (SPD),  the  well-known  conjugate  gradient  method  converges  at 
a  constant  rate  which  depends  on  k.  After  n  iterations  of  the  conjugate  gradient 
method,  the  error  e  satisfies 

(2) 

For  most  applications  of  interest  in  computational  fluid  dynamics,  the  condi¬ 
tion  number  associated  with  A  depends  on  computational  paraimeters  such  as 
the  mesh  spacing  h,  added  stabilization  terms,  and/or  artificial  viscosity  coefii- 
cients.  In  addition,  k{A)  can  depend  on  physical  parameters  such  as  the  Peclet 
number  and  flow  direction  as  well  as  the  underlying  stability  and  well-posedness 
of  the  PDE  and  boundary  conditions.  Of  particular  interest  in  algorithm  de¬ 
sign  and  implementation  is  the  parallel  scalability  experiment  whereby  a  mesh 
discretization  of  the  PDE  is  successively  refined  while  keeping  a  fixed  physical 
domain  so  that  the  mesh  spacing  h  uniformly  approaches  zero.  In  this  setting, 
the  matrix  A  usually  becomes  increasingly  ill-conditioned  because  of  the  depen¬ 
dence  of  k{A)  on  h.  A  standard  technique  to  overcome  this  ill-conditioning  is  to 
solve  the  prototype  linear  system  in  right  (or  left)  preconditioned  form 

{AP-^)Px  -6  =  0.  (3) 

The  solution  is  unchanged  but  the  convergence  rate  of  iterative  methods  now 
depends  on  properties  of  AP~^.  Ideally,  one  seeks  preconditioning  matrices  P 
which  are  easily  solved  and  in  some  sense  nearby  A,  e.g.  k{AP~^)  =  0(1)  when 
A  is  SPD.  The  situation  changes  considerably  for  advection  dominated  problems. 
The  matrbc  A  ceases  to  be  SPD  so  that  the  performance  of  iterative  methods  is 
not  directly  linked  to  the  condition  number  behavior  of  A.  Moreover,  the  con¬ 
vergence  properties  associated  with  A  can  depend  on  nonlocal  properties  of  the 
PDE.  To  see  this,  consider  the  advection  and  advection-diffusion  problems  shown 
in  Fig.  1.  The  entrance/exit  flow  shown  in  Fig.  1(a)  transports  the  solution  and 
any  error  components  along  45°  characteristics  which  eventually  exit  the  domain. 
This  is  contrasted  with  the  recirculation  flow  shown  in  Fig.  1(b)  which  has  circu¬ 
lar  characteristics  in  the  advection  dominated  limit.  In  this  (singular)  limit,  any 
radially  symmetric  error  components  persist  for  all  time.  More  generally,  these 
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recirculation  error  components  are  removed  by  the  physical  cross-wind  diflFusion 
terms  present  in  the  PDE  or  the  artificial  cross-wind  diffusion  terms  introduced 
by  the  numerical  discretization.  When  the  advection  speed  is  large  and  the  cross- 
wind  diffusion  small,  the  problem  becomes  ill-conditioned.  Brandt  and  Yavneh 
[6]  have  studied  both  entrance/exit  and  recirculation  fiow  within  the  context  of 
multigrid  acceleration.  The  behavior  of  multigrid  (or  most  other  iterative  meth- 


(a)  45°  error  component  transport  for  (b)  Radially  symmetric  error  compo- 
entrance/exit  flow.  nent  for  recirculating  flow. 


Fig.  1.  Two  model  advection  flows:  (a)  entrance/exit  flow  Ux+Uy  —  0,  (b)  recirculating 
flow  yux  —  xuy  =  limtio  tAu. 


ods)  for  these  two  flow  problems  is  notably  different.  For  example,  Fig.  2  graphs 
the  convergence'  history  of  ILU  (O)-preconditioned  GMRES  in  solving  Cuthill- 
McKee  ordered  matrix  problems  for  entrance/exit  flow  and  recirculation  flow 
discretized  using  the  Galerkin  least-squares  (GLS)  procedure  described  in  Sect. 
3.  The  entrance/exit  flow  matrix  problem  is  solved  to  a  10“*  accuracy  tolerance 
in  approximately  20  ILU(0)-GMRES  iterations.  The  recirculation  flow  problem 
with  e  =  10~®  requires  45  ILU(0)-GMRES  iterations  to  reach  the  10“®  tolerance 
and  approximately  100  ILU(0)-GMRES  iterations  with  €_=  0.  This  difference  in 
the  number  of  iterations  required  for  ecich  problem  increases  dramatically  as  the 
mesh  is  refined.  Using  the  non-overlapping  DD  method  described  in  Sect.  5.5, 
we  can  remove  the  ill-conditioning  observed  in  the  recirculating  flow  problem. 
Let  Vh  denote  the  set  of  vertices  along  a  nearly  horizontal  fine  from  the  center 
of  the  domain  to  the  right  boundary  and  Vs  the  set  of  remaining  vertices  in  the 
mesh,  see  Fig.  3.  Next,  permute  the  discretization  matrix  so  that  solution  un¬ 
knowns  corresponding  to  Vh  are  ordered  last.  The  remaining  mesh  vertices  have 
a  natural  ordering  along  characteristics  of  the  advection  operator  which  renders 
the  discretization  matrix  associated  with  Vs  nearly  lower  triangular.  Using  the 
technique  of  Sect.  5.5  together  with  exact  factorization  of  the  small  IV;/ 1  x  \Vh\ 
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Schur  complement,  acceptable  convergence  rates  for  ILU(0)-preconditioned  GM- 
RES  8ire  once  again  obtainable  as  shown  in  Fig.  3.  These  promising  results  have 
strenghthened  our  keen  interest  in  DD  for  fluid  flow  problems. 


Fig.  2.  Convergence  behavior  of  ILU(0)-preconditioned  GMRES  for  entrance/exit  and 
recirculation  flow  problems  using  GLS  discretization  in  a  triangulated  square  (1600 
dofs). 


(a)  Horizontal  vertex  set  ordered  last 
in  matrix  (circled  vertices). 


GMRES  Matrix-Vector  Multiplies 
(b)  ILU-GMRES  convergence  history. 


Fig.  3.  Sample  mesh  and  ILU-GMRES  convergence  history  using  the  non-overlapping 
DD  technique  of  Sect.  5.5. 
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3  Stabilized  Numeric£d  Discretization 


Non-overlapping  domain-decomposition  procedures  such  as  those  developed  in 
Sect.  5.5  strongly  motivate  the  use  of  compact-stencil  spatial  discretizations 
since  larger  discretization  stencils  produce  larger  interface  sizes.  For  this  reason, 
the  Petrov-Galerkin  approximation  due  to  Hughes,  Franca  and  Mallet  [17, 18] 
has  been  used  in  the  present  study.  Consider  the  prototjT)e  conservation  law 
system  in  m  coupled  independent  variables  in  the  spatial  domain  C  with 
boundary  surface  F  and  exterior  normal  n(x) 

“,t  +  ^  ^  ^  [0)  JR'"'’]  (4) 

(ni  f,  u)“  (u  -  g)  =  0,  (i,  t)  er  X  [0,  R''']  (5) 

with  implied  summation  over  repeated  indices.  In  this  equation,  u  €  R”*  denotes 
the  vector  of  conserved  variables  and  f*  e  R*”  the  inviscid  flux  vectors.  The 
vector  g  can  be  suitably  ,  chosen  to  impose  characteristic  data  or  surface  flow' 
tangency  using  reflection  principles.  The  conservation  law  system  (4)  is  assumed 
to  possess  a  generalized  entropy  pair  so  that  the  change  of  variables  u(v)  : 
R”*  R’"  symmetrizes  the  system  in  quasi-linear  form 

U,vV,t  -f-  rvV,xj  =  0  (6) 

with  u,v  symmetric  positive  definite  and  symmetric.  The  computational 
domain  f?  is  composed  of  non-overlapping  simplicial  elements  Tj,  /?  =  UTi, 
TiHTj  =  0,  i  5^  j.  For  purposes  of  the  present  study,  our  attention  is  restricted  to 
steady-state  calculations.  Time  derivatives  are  retzuned  in  the  Galerkin  integral 
so  that  a  pseudo-time  maurching  strategy  can  be  used  for  obtaining  steady-state 
solutions.  The  Galerkin  least-squares  method  due  to  Hughes,  Franca  and  Mallet 
[17]  can  be  defined  via  the  following  variational  problem  writh  time  derivatives 
omitted  from  the  least-squares  bilinear  form:  Let  denote  the  finite  element 

space  =  |w'*|w'*  6  ^C7°(l?)^  G  (Vk{T)^  |. 

Find  e  such  that  for  all  €  V* 


with 


■B(v,w)jo,  =  f  (w^u(v),t  -  w^,r(v))d/? 
Jn 

(fVvi.) 


(7) 


dil 


B{y,'w)bc  =  j  h(v,  g;  n)  d  r 


where 


h(v_,v+,n)  =  ^(f(u(v_);n)  +f(u(v+);n))  -  i|A(u(v); n)|(u(v+)  -  u(v_)). 
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Inserting  standard  polynomial  spatial  approximations  and  mass-lumping  of 
the  remaining  time  derivative  terms,  yields  coupled  ordinary  differential  equa¬ 
tions  of  the  form: 


Z?ut=7^(u),  7^(u)  :  IR” IR" 


(8) 


or  in  symmetric  variables 

D  u,vVt  =  Tl{u{y)) 


(9) 


where  D  represents  the  (diagonal)  lumped  mass  matrix.  In  the  present  study, 
backward  Euler  time  integration  with  local  time  linearization  is  applied  to  Eqn. 
(8)  yielding: 


At  \duj 


(^n+l  _  ^ 


(10) 


The  above  equation  can  also  be  viewed  as  a  modified  Newton  method  for  solv¬ 
ing  the  steady-state  equation  7^(u)  =  0.  For  each  modified  Newton  step,  a  large 
Jacobian  matrix  must  be  solved.  In  practice  At  is  varied  as  an  exponential  func¬ 
tion  ||7l^(u)||  so  that  Newton’s  method  is  approached  as  ||7?.(u)l|  -»•  0.  Since  each 
Newton  iterate  in  (10)  produces  a  linear  system  of  the  form  (1),  our  attention 
focuses  on  this  prototype  linear  form. 


4  Domain  Partitioning 

In  the  present  study,  meshes  are  partitioned  using  the  multilevel  fc-way  parti¬ 
tioning  algorithm  METIS  developed  by  Karypis  and  Kumar  [19].  Figure  4(a) 
shows  a  typical  airfoil  geometry  and  triangulated  domain.  To  construct  a  non- 


(a)  Mesh  triangulation  (80,000  ele-  (b)  Mach  number  solution  contours 
ments).  and  partition  (bold  lines). 


Fig.  4.  Multiple  component  airfoil  geometry  with  16  subdomain  partitioning  and  sam¬ 
ple  solution  contours  (Moo  =  .20, a  =  10°). 
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overlapping  partitioning,  a  dual  triangulation  graph  has  been  provided  to  the 
METIS  partitioning  software.  Figure  4(b)  shows  partition  boundaries  and  sam¬ 
ple  solution  contours  using  the  spatial  discretization  technique  described  in  the 
previous  section.  By  partitioning  the  dual  graph  of  the  triangulation,  the  number 
of  elements  in  each  subdomain  is  automatically  balanced  by  the  METIS  software. 
Unfortunately,  a  large  percentage  of  computation  in  our  domain-decomposition 
algorithm  is  proportional  to  the  interface  size  associated  with  each  subdomain. 
On  general  meshes  containing  non-uniform  element  densities,  balancing  subdo¬ 
main  sizes  does  not  imply  a  balance  of  interface  sizes.  In  fact,  results  shown  in 
Sect.  6  show  increased  imbalance  of  interface  sizes  as  meshes  are  partitioned  into 
larger  numbers  of  subdomains.  This  ultimately  leads  to  poor  load  balancing  of 
the  parallel  computation.  This  topic  will  be  revisited  in  Sect.  6. 

5  Preconditioning  Algorithms  for  CFD 

In  this  section,  we  consider  several  candidate  preconditioning  techniques  based 
on  overlapping  and  non-overlapping  domain  decomposition. 


5.1  ILU  Factorization 

A  common  preconditioning  choice  is  incomplete  lower-upper  factorization  with 
arbitrary  fill  level  Jfc,  ILU[A:].  Early  application  and  analysis  of  ILU  precondi¬ 
tioning  is  given  in  Evans  [15],  Stone  [27]  aind  Meijerink  and  van  der  Vorst  [21]. 
Although  the  technique  is  algebraic  and  well-suited  to  sparse  matrices,  ILU- 
preconditioned  systems  are  not  generally  scalable.  For  example,  Dupont  et  al. 
[14]  have  shown  that  ILU[0]  preconditioning  does  not  asymptotically  change  the 
0{h~^)  condition  number  of  the  5-point  difference  approximation  to  Laplace’s 
equation.  Figure  5  shows  the  convergence  of  ILU-preconditioned  GMRES  for 
Cuthill-McKee  ordered  matrix  problems  obtained  firom  diffusion  and  advection 
dominated  problems  discretized  using  Galerkin  and  Galerkin  least-squares  tech¬ 
niques  respectively  with  linear  elements.  Both  problems  show  pronounced  con¬ 
vergence  deterioration  as  the  number  of  solution  unknowns  (degrees  of  freedom) 
increases.  Note  that  matrix  orderings  exist  for  discretized  scalar  advection  equa¬ 
tions  that  are  vastly  superior  to  Cuthill-McKee  ordering.  Unfortunately,  these 
orderings  do  not  generalize  naturally  to  coupled  systems  of  equations  which  do 
not  have  a  single  characteristic  direction.  Some  ILU  matrix  ordering  experiments 
are  given  in  [10].  Keep  in  mind  that  ILU  does  recluster  eigenvalues  of  the  pre¬ 
conditioned  matrix  so  that  for  small  enough  problems  a  noticeable  improvement 
Ccui  often  be  observed  when  ILU  preconditioning  is  combined  with  a  Krylov 
projection  sequence. 


5.2  Additive  Overlapping  Schwarz  Methods 

Let  V  denote  the  triangulation  vertex  set.  Assume  the  triangulation  has  been 
partitioned  into  N  overlapping  subdomains  with  vertex  sets  Vi,i  =  1, . . . ,  N  such 
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(a)  Diffusion  dominated  problem. 


(b)  Advection  dominated  problem. 


Fig.  5.  Convergence  dependence  of  ILU  on  the  number  of  mesh  points  for  diffusion  and 
advection  dominated  problems  using  SUPG  discretization  and  Cuthill-McKee  ordering. 


that 

V  =  uiliVi  . 

Let  Ri  denote  the  rectangular  restriction  matrix  that  returns  the  vector  of  coef¬ 
ficients  in  the  subdomain  Oi,  i.e. 

=  RiX  ■ 

Note  that  =  RiARf  is  the  subdomain  discretization  matrix  in  l?i.  The 
additive  Schwarz  preconditioner  P~^  firom  (3)  is  then  written  as 

p-^  =  '£RiA-^Rj  . 

i=l 

The  additive  Schwarz  algorithm  [24]  is  appealing  since  each  subdomain  solve  can 
be  performed  in  parallel.  Unfortunately  the  performance  of  the  algorithm  dete¬ 
riorates  as  the  number  of  subdomains  increases.  Let  H  denote  the  characteristic 
size  of  each  subdomain,  S  the  overlap  distance,  and  h  the  mesh  spacing.  Dryja 
and  Widlund  [12, 13]  give  the  following  condition  number  bound  for  the  method 
when  used  as  a  preconditioner  for  elliptic  discretizations 

k{AP-^)  <  CH-^  (l  +  (11) 

where  C  is  a  constant  independent  of  H  and  h.  This  result  describes  the  dete¬ 
rioration  as  the  number  of  subdomains  increases  (and  H  decreases).  With  some 
additional  work  this  deterioration  can  be  removed  by  the  introduction  of  a  global 
coarse  subspace  with  restriction  matrix  Rq  with  scale  H  so  that 

=  RoAR:^  +  RiA-^Rj  . 

t=i 
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Under  the  assumption  of  “generous  overlap”  the  condition  number  bound  [12, 
13, 8]  can  be  improved  to 


k{AP-^)<C{\  +  {HI6))  .  (12) 

The  addition  of  a  coarse  space  approximation  introduces  implementation  prob¬ 
lems  similar  to  those  found  in  multigrid  methods  described  below.  Once  again. 


(a)  Effect  of  increasing  mesh  overlap,  (b)  Effect  of  increasing  number  of  sub- 

domains. 

Fig.  6.  Performance  of  GMRES  with  additive  overlapping  Schwarz  preconditioning. 


the  theory  associated  with  additive  Schwarz  methods  for  hyperbolic  PDE  sys¬ 
tems  is  not  well-developed.  Practical  applications  of  the  additive  Schwarz  method 
for  the  steady-state  calculation  of  hyperbolic  PDE  systems  show  similar  dete¬ 
rioration  of  the  method  when  the  coarse  space  is  omitted.  Figure  6  shows  the 
performance  of  the  additive  Schwarz  algorithm  used  as  a  preconditioner  for  GM¬ 
RES.  The  test  matrix  was  taken  from  one  step  of  Newton’s  method  appUed  to  an 
upwind  finite  volume  discretization  of  the  Euler  equations  at  low  Mach  number 
(Moo  =  -2),  see  Barth  [1]  for  further  details.  These  calculations  were  performed 
without  coarse  mesh  correction.  As  expected,  the  graphs  show  a  degradation  in 
quality  with  decreasing  overlap  and  increasing  number  of  mesh  partitions. 


5.3  Additive  Overlapping  Schwarz  Methods  for  Time-Dependent 
Fluid  Flow 

We  begin  by  giving  a  brief  sketch  of  the  analysis  given  in  Wu  et  al.  [28]  which 
shows  that  for  small  enough  time  step  and  large  enough  overlap,  the  additive 
Schwarz  preconditioner  for  hyperbolic  problems  behaves  optimally  without  re¬ 
quiring  a  coarse  space  correction. 
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Consider  the  model  scalar  hyperbolic  equation  for  the  spatial  domain  i?  C  IR 
with  characteristic  boundary  data  g  weakly  imposed  on  F 

^  +  ^-Vu  +  cu  =  0,  (x,t)€/?x[0,r]  (13) 

ot 

{0  ■  n{x))~  (u-g)  =  0,  I  G  r 

with  0  €  B'*,  c>  0,  and  suitable  initial  data.  Suppose  that  backward  Euler  time 
integration  is  employed  (u"(x)  =  u{x,ti  At)),  so  that  (13)  can  then  be  written 

/3-Vu"  +  (c  +  (Zit)-‘)u”  =  / 

with  /  =  (/lt)"^u"“^  Next  solve  this  equation  using  Galerkin’s  method  (drop¬ 
ping  the  superscript  n):  Find  u  €  H^{Q) 

[0  •  Vu,  v)  +  {c  +  iAt)-'^)iu,v)  =  {f,v)+<u-g,v>-  Vn  G  H^{n) 

where  {u,v)  =  J^uv  dx  and  <  u,v  >±=  jruv{0  ■  n(x))=‘=  dx.  Recall  that 
Galerkin’s  method  for  linear  advection  is  iso-energetic  modulo  boundary  condi¬ 
tions  so  that  the  symmetric  part  of  the  bihnear  form  is  simply 

A{u,v)  =  (c-l-  (Zlt)-^)  iu,v)  -I-  i  (<  u.n  >+  -  <  u,i;  >_) 

with  skew-symmetric  portion  S{u,v)  =  |  <  u,v  >  —{u,0-  Vv).  Written  in 
this  form,  it  becomes  clear  that  the  term  (c-1-  (^t)-^)  (u,v)  eventually  doui- 
inates  the  skew-symmetric  bilinear  term  if  At  is  chosen  small  enough.  This 
leads  to  the  CFL-like  assumption  that  \0\At  <  s  >  0,  see  [28].  With 

this  assumption,  scalability  of  the  overlapping  Schwarz  method  without  coarse 
space  correction  can  be  shown.  Unfortunately,  the  assumed  CFL-like  restric¬ 
tion  makes  the  method  impractical  since  more  efficient  explicit  time  advance¬ 
ment  strategies  could  be  used  which  obviate  the  need  for  mesh  overlap  or  im- 
phcit  subdomain  solves.  The  situation  changes  considerably  if  a  Petrov-Galerkin 
discretization  strategy  if  used  such  as  described  in  Sect.  3.  For  a  the  scalar 
model  equation  (13)  this  amounts  to  added  the  symmetric  bilinear  stabilization 
term  Bu{u,v)  =  {0  •  Wu,t0  ■  Vv)  to  the  previous  Galerkin  formulation:  Find 
u£H^{Q) 

{0-Vu,v)+{0-Vu,T0-Vv)+{c+{At)-'^)  {u,v)  =  {f,v)+  <  u-g,v  >_  Vu  G  H^{(i) 

where  r  =  hl{2\0\).  This  strategy  is  considered  in  a  sequel  paper  to  [28]  which 
has  yet  to  appear  in  the  open  literature.  Practical  CFD  calculations  show  surpris¬ 
ing  good  performance  of  overlapping  Schwarz  preconditioning  when  combined 
with  Galerkin  least-squares  discretization  of  hyperbolic  systems  as  discussed  in 
Sect.  3.  Figure  8  shows  iso-density  contours  for  Mach  3  flow  over  a  backward¬ 
facing  step  geometry  using  a  triangulated  mesh  containing  22000  mesh  vertices 
which  has  been  partitioned  into  1,  4,  16,  and  32  subdomains  for  evaluation  pur¬ 
poses.  Owing  to  the  nonlinearity  of  the  strong  shockwave  profiles,  the  solution 
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must  be  evolved  in  time  at  a  relatively  small  Courant  number  <  20  to  prevent 
nonlinear  instability  in  the  numerical  method.  On  the  other  hand,  the  solution 
eventually  reaches  an  equilibrium  state.  (Note  that  on  finer  resolution  meshes, 
the  fiuid  contact  surface  emanating  firom  the  Mach  triple  point  eventually  makes 
the  fiow  field  unsteady.)  This  test  problem  provides  an  ideal  candidate  scenario 
for  the  overlapping  Schwarz  method  since  time  accuracy  is  not  essential  to  reach¬ 
ing  the  correct  steady-state  solution.  Computations  were  performed  on  a  fixed 


Fig.  7.  Isodensity  contours  for  Mach  3  inviscid  Euler  flow  over  a  backward-facing  step 
with  exploded  view  of  16  subdomain  partitioning. 


domains  {CFL  =  15). 


Fig.  8.  Number  of  ILU(0)-GMRES  iterations  required  to  reduce  ||Ax  —  6||  <  10  ® 


size  mesh  with  1,  4,  16,  and  32  subdomains  while  also  varying  the  overlap  (graph) 
distances  values  and  the  CFL  number.  Figure  8(a)  shows  the  effect  of  increasing 
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CFL  number  and  subdomain  mesh  overlap  distance  on  the  number  of  global 
GMRES  iterations  required  to  solve  the  global  matrix  problem  to  an  accuracy 
of  less  than  10“®  using  additive  Schwarz-like  ILU(O)  on  overlapped  subdomain 
meshes.  For  CFL  numbers  less  than  about  10,  the  number  of  GMRES  iterations 
is  relatively  insensitive  to  the  amoimt  of  overlap.  Figure  8(b)  shows  the  effect 
of  increased  mesh  partitioning  on  the  number  of  GMRES  iterations  required 
(assuming  a  fixed  CFL  =  15).  For  overlap  distance  >  2,  the  iterative  method  is 
relatively  insensitive  to  the  number  of  subdomains.  By  lowering  the  CFL  number 
to  10,  the  results  become  even  less  sensitive  to  the  number  of  subdomains. 


5.4  Multi-level  Methods 

In  the  past  decade,  multi-level  approaches  such  as  multigrid  has  proven  to  be  one 
of  the  most  effective  techniques  for  solving  discretizations  of  elliptic  PDEs  [29]. 
For  certain  classes  of  elliptic  problems,  multigrid  attains  optimal  scalability.  For 
hyperbolic-elliptic  problems  such  as  the  steady-state  Navier-Stokes  equations, 
the  success  of  multigrid  is  less  convincing.  For  example.  Ref.  [20]  presents  nu¬ 
merical  results  using  multigrid  to  solve  compressible  Navier-Stokes  flow  about 
a  multiple-component  wing  geometry  with  asymptotic  convergence  rates  ap¬ 
proaching  .98  (Fig.  12  in  Ref.  [20]).  This  is  quite  far  from  the  usual  convergence 
rates  quoted  for  multigrid  on  elliptic  model  problems.  This  is  not  too  surpris¬ 
ing  since  multigrid  for  hyperbolic-elliptic  problems  is  not  well-understood.  In 
addition,  some  multigrid  algorithms  require  operations  such  as  mesh  coarsen¬ 
ing  which  are  poorly  defined  for  general  meshes  (especially  in  3-D)  or  place 
unattainable  shape-regularity  demands  on  mesh  generation.  Other  techniques 
add  new  meshing  constraints  to  existing  software  packages  which  limit  the  over¬ 
all  applicability  of  the  software.  Despite  the  promising  potential  of  multigrid  for 
non-selfadjoint  problems,  we  defer  ftirther  consideration  and  refer  the  reader  to 
works  such  as  [6, 5]. 


5.5  Schur  Complement  Algorithms 

Schur  complement  preconditioning  algorithms  are  a  general  family  of  algebraic 
techniques  in  non-overlapping  domadn-decomposition.  These  techniques  can  be 
interpreted  as  variants  of  the  well-known  substructuring  method  introduced  by 
Przemieniecki  [22]  in  structural  analysis.  When  recursively  applied,  the  method 
is  related  to  the  nested  dissection  algorithm.  In  the  present  development,  we 
consider  an  arbitrary  domain  as  illustrated  in  Fig.  9  that  has  been  further  de¬ 
composed  into  subdomains  labeled  1—4,  interfaces  labeled  5—9,  and  cross  points 
X.  A  natural  2x2  partitioning  of  the  system  is  induced  by  permuting  rows  and 
columns  of  the  discretization  matrix  so  that  subdomain  unknowns  are  ordered 
first,  interface  unknowns  second,  and  cross  points  ordered  last 


Ax 


Avv  Avi  ( 

Aiv  All  _  \xi )  \h  J 


(14) 


where  xxi,xi  denote  the  subdomain  and  interface  variables  respectively.  The 
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(a)  Partitioned  domain. 


(b)  Induced  2x2  block  discretization 
matrix. 


Fig.  9.  Domain  decomposition  and  the  corresponding  block  matrix. 


block  LU  factorization  of  A  is  then  given  by 


A  =  LU  = 


Adv  0  I  A-j^-pAvi 

Ajv  I  0  S 


(15) 


where 

S  =  Ajx  -  AxvA:^\)Avi  (16) 

is  the  Schur  complement  for  the  system.  Note  that  At>v  is  block  diagonal  with 
each  block  associated  with  a  subdomain  matrix.  Subdomains  are  decoupled  from 
each  other  and  only  coupled  to  the  interface.  The  subdomain  decoupling  property 
is  exploited  heavily  in  parallel  implementations. 

In  the  next  section,  we  outline  a  naive  parallel  implementation  of  the  “exact” 
factorization.  This  will  serve  as  the  basis  for  a  number  of  simplifying  approxi¬ 
mations  that  will  be  discussed  in  later  sections. 


5.6  “Exact"  Factorization 

Given  the  domain  partitioning  illustrated  in  Fig.  9,  a  straightforward  (but  naive) 
parallel  implementation  would  assign  a  processor  to  each  subdomain  and  a  sin¬ 
gle  processor  to  the  Schur  complement.  Let  Tj  denote  the  union  of  interfaces 
surrounding  Vi.  The  entire  solution  process  would  then  consist  of  the  following 
steps: 

Parallel  Preprocessing: 

1.  Parallel  computation  of  subdomain  AviVi  matrix  LU  factors. 
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2.  Parallel  computation  of  Schur  complement  block  entries  associated  with  each 
subdommn  Pi 

(17) 


-  ^IivAv]vAviIi  ■ 


3.  Accumulation  of  the  global  Schim  complement  S  matrbc 

^iubdomains 

S  =  All  -  • 

i=l 


(18) 


Solution: 


Step 

(1) 

UVi 

-  A-^ 

.  hvi 

Step 

(2) 

u-Oi 

Step 

(3) 

Wt 

t  subdomains 

—  Oj  —  ^t=i 

Step 

(4) 

XX 

=  S~^  wj 

Step 

(5) 

yvi 

~  ^Vili 

Step 

(6) 

XVi 

=  UVi  - 

(parallel) 

(parallel) 

(communication) 
(sequential,  communication) 
(parallel) 

(parallel) 


This  algorithm  has  several  deficiencies.  Steps  3  and  4  of  the  solution  process 
are  sequential  and  require  communication  between  the  Schur  complement  and 
subdomains.  More  generally,  the  algorithm  is  not  scalable  since  the  growth  in 
size  of  the  Schur  complement  with  increasing  number  of  subdomains  eventually 
overwhelms  the  calculation  in  terms  of  memory,  computation,  and  communica¬ 
tion. 


5.7  Iterative  Schur  Complement  Algorithms 

A  number  of  approximations  have  been  investigated  in  Barth  et  al.  [2]  which 
simplify  the  exact  factorization  algorithm  and  address  the  growth  in  size  of  the 
Schur  complement.  During  this  investigation,  our  goal  has  been  to  develop  alge¬ 
braic  techniques  which  can  be  applied  to  both  elliptic  and  hyperbolic  partial  dif¬ 
ferential  equations.  These  approximations  include  iterative  (Krylov  projection) 
subdomain  and  Schur  complement  solves,  element  dropping  and  other  sparsity 
control  strategies,  localized  subdomain  solves  in  the  formation  of  the  Schur  com¬ 
plement,  and  partitioning  of  the  interface  and  parallel  distribution  of  the  Schur 
complement  matrbc.  Before  describing  each  approximation  and  technique,  we 
can  make  several  observations: 

Observation  1.  (Ill-conditioning  of  Subproblems)  For  model  elliptic  prob¬ 
lem  discretizations,  it  is  known  in  the  two  subdomain  case  that  /c(Ai>,7?i)  = 
0((L//i)2)  ^(5)  _  0{L/h)  where  L  denotes  the  domain  size.  From  this 

perspective,  both  subproblems  are  ill-conditioned  since  the  condition  number 
depends  on  the  mesh  spacing  parameter  h.  If  one  considers  the  scalability  ex¬ 
periment,  the  situation  changes  in  a  subtle  way.  In  the  scalability  experiment, 
the  number  of  mesh  points  and  the  number'  of  subdomains  is  increased  such 
that  the  ratio  of  subdomain  size  to  mesh  spacing  size  H/h  is  held  constant.  The 
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subdomain  matrices  for  elliptic  problem  discretizations  now  exhibit  a  0{{H/h)^) 
condition  number  so  the  cost  associated  with  iteratively  solving  them  (with  or 
without  preconditioning)  is  approximately  constant  as  the  problem  size  is  in¬ 
creased.  Therefore,  this  portion  of  the  algorithm  is  scalable.  Even  so,  it  may  be 
desirable  to  precondition  the  subdomain  problems  to  reduce  the  overall  cost. 
The  Schur  complement  matrix  retains  (at  best)  the  0{Lfh)  condition  number 
and  becomes  increasingly  ill-conditioned  as  the  mesh  size  is  increased.  Thus  in 
the  scalability  experiment,  it  is  ill-conditioning  of  the  Schur  complement  matrix 
that  must  be  controlled  by  adequate  preconditioning,  see  for  example  Dryja, 
Smith  and  Widlund  [11]. 

Observation  2.  (Non-stationary  Preconditioning)  The  use  of  Krylov  pro¬ 
jection  methods  to  solve  the  local  subdomain  and  Schur  complement  subprob¬ 
lems  renders  the  global  preconditioner  non-stationary.  Consequently,  Krylov  pro¬ 
jection  methods  designed  for  non-stationary  preconditioners  should  be  used  for 
the  global  problem.  For  this  reason,  FGMRES  [23],  a  variant  of  GMRES  designed 
for  non-stationary  preconditioning,  has  been  used  in  the  present  work. 

Observation  3.  (Algebraic  Coarse  Space)  The  Schur  complement  serves  as 
an  algebraic  coarse  space  operator  since  the  system 


Sxj  =  bj  —  AiDA-p\)bx>  (19) 

globally  couples  solution  unknowns  on  the  entire  interface.  The  rapid  propaga¬ 
tion  of  information  to  large  distances  is  a  crucial  component  of  optimal  algo¬ 
rithms. 


5.8  ILU-GMRES  Subdomain  and  Schur  complement  Solves 

The  first  natural  approximation  is  to  replace  exact  inverses  of  the  subdomain 
and  Schur  complement  subproblems  with  an  iterative  Krylov  projection  method 
such  as  GMRES  (or  stabilized  biconjugate  gradient). 


Iterative  Subdomain  Solves  Recall  from  the  exact  factorization  algorithm 
that  a  subdomain  solve  is  required  once  in  the  preprocessing  step  and  twice 
in  the  solution  step.  This  suggests  replacing  these  three  inverses  with  mi,  m2, 
and  m3  steps  of  GMRES  respectively.  As  mentioned  in  Observation  1,  although 
the  condition  number  of  subdomain  problems  remains  roughly  constant  in  the 
scalability  experiment,  it  still  is  beneficial  to  precondition  subdomain  problems 
to  improve  the  overall  efficiency  of  the  global  preconditioner.  By  preconditioning 
subdomain  problems,  the  parameters  mi, m2, m3  can  be  kept  small.  This  will  be 
exploited  in  later  approximations.  Since  the  subdomain  matrices  are  assumed 
given,  it  is  straightforward  to  precondition  subdomains  using  ILU[A:].  For  the 
GLS  spatial  discretization,  satisfactory  performance  is  achieved  using  ILU[2]. 
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Iterative  Schur  complement  Solves  It  is  possible  to  avoid  explicitly  com¬ 
puting  the  Schur  complement  matrix  for  use  in  Krylov  projection  methods  by 
alternatively  computing  the  action  of  5  on  a  given  vector  p,  i.e. 

Sp  =  Axip- AivAl^^yA-Dip  ■  (20) 


Unfortunately  S  is  ill-conditioned,  thus  some  form  of  interface  preconditioning  is 
needed.  For  elliptic  problems,  the  rapid  decay  of  elements  away  from  the  diagonal 
in  the  Schur  complement  matrix  [16]  permits  simple  preconditioning  techniques. 
Bramble,  Pasciak,  and  Schatz  [4]  have  shown  that  even  the  simple  block  Jacobi 
preconditioner  yields  a  substantial  improvement  in  condition  number 

k{SPs^)  <  CH-^  (1  -b log^H/h))  (21) 


for  C  independent  of  h  and  H.  For  a  small  number  of  subdomains,  this  technique 
is  very  effective.  To  avoid  the  explicit  formation  of  the  diagonal  blocks,  a  number 
of  simplified  approximations  have  been  introduced  over  the  last  several  years, 
see  for  examples  Bjorstad  [3]  or  Smith  et  al.  [26].  By  introducing  a  further  coarse 
space  coupling  of  cross  points  to  the  interface,  the  condition  number  is  further 
improved 

«(5Pf ')  <  C  (1 -Hog2(/f//i))  .  (22) 

Unfortunately,  the  Schur  complement  associated  with  advection  dominated  dis¬ 
cretizations  may  not  exhibit  the  rapid  element  decay  found  in  the  elliptic  case. 
This  can  occur  when  characteristic  trajectories  of  the  advection  equation  tra¬ 
verse  a  subdomain  from  one  interface  edge  to  another.  Consequently,  the  Schur 
Complement  is  not  well-preconditioned  by  elliptic-like  preconditioners  that  use 
the  action  of  local  problems.  A  more  basic  strategy  has  been  developed  in  the 
present  work  whereby  elements  of  the  Schur  complement  are  explicitly  computed. 
Once  the  elements  have  been  computed,  ILU  factorization  is  used  to  precondition 
the  Schur  complement  iterative  solution.  In  principle,  ILU  factorization  with  a 
suitable  reordering  of  unknowns  can  compute  the  long  distance  interactions  asso¬ 
ciated  with  simple  advection  fields  corresponding  to  entrance/exit-like  flows.  For 
general  advection  fields,  it  remains  a  topic  of  current  research  to  find  reordering 
algorithms  suitable  for  ILU  factorization.  The  situation  is  further  complicated 
for  coupled  systems  of  hyperbolic  equations  (even  in  two  independent  variables) 
where  multiple  characteristic  directions  and/or  Cauchy-Riemann  systems  can 
be  produced.  At  the  present  time,  Cuthill-McKee  ordering  has  been  used  on  all 
matrices  although  improved  reordering  algorithms  are  currently  under  develop¬ 
ment. 

In  the  present  implementation,  each  subdomain  processor  computes  (in  par¬ 
allel)  and  stores  portions  of  the  Schur  complement  matrix 


AS-. 


Ti 


-  ^IivAv]vi^ViIi 


(23) 


To  gain  improved  parallel  scalability,  the  interface  edges  and  cross  points  are 
partitioned  into  a  smaller  number  of  generic  “subinterfaces” .  This  subinterface 
partitioning  is  accomplished  by  assigning  a  supernode  to  each  interface  edge 
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Fig.  10.  Interface  (bold  lines)  decomposed  into  4  subinterfaces  indicated  by  alternating 
shaded  regions. 


separating  two  subdomains,  forming  the  graph  of  the  Schur  complement  matrix 
in  terms  of  these  supernodes,  and  applying  the  METIS  partitioning  software  to 
this  graph.  Let  Xj  denote  the  j-th  subinterface  such  that  I  —  Ujlj.  Computation 
of  the  action  of  the  Schur  complement  matrix  on  a  vector  p  needed  in  Schur 
complement  solves  now  takes  the  (highly  parallel)  form 

subinter  faces  if  subdomains 

^P=  E  E  (24) 

j=l  ^  ’  i-l 

Using  this  formula  it  is  straightforward  to  compute  the  action  of  5  on  a  vec¬ 
tor  p  to  any  required  accuracy  by  choosing  the  subdomain  iteration  parameter 
rrii  large  enough.  Figure  10  shows  an  interface  and  the  immediate  neighboring 
mesh  that  has  been  decomposed  into  4  smaller  subinterface  partitions  for  a  32 
subdomain  partitioning.  By  choosing  the  number  of  subinterface  partitions  pro¬ 
portional  to  the  square  root  of  the  number  of  2-D  subdomains  and  assigning  a 
processor  to  each,  the  number  of  solution  unknowns  associated  with  each  subin¬ 
terface  is  held  approximately  constant  in  the  scalability  experiment.  Note  that 
the  use  of  iterative  subdomain  solves  renders  both  Eqns.  (20)  and  (24)  approxi¬ 
mate. 

In  our  investigation,  the  Schur  complement  is  preconditioned  using  ILU  fac¬ 
torization.  This  is  not  a  straightforward  task  for  two  reasons:  (1)  portions  of 
the  Schur  complement  are  distributed  among  subdomain  processors,  (2)  the  in¬ 
terface  itself  has  been  distributed  among  several  subinterface  processors.  In  the 
next  section,  a  block  element  dropping  strategy  is  proposed  for  gathering  por- 
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tions  of  the  Schur  complement  together  on  subinterface  processors  for  use  in 
ILU  preconditioning  the  Schur  complement  solve.  Thus,  a  block  Jacobi  precon¬ 
ditioner  is  constructed  for  the  Schur  complement  which  is  more  powerful  than 
the  Bramble,  Pasciak,  and  Schatz  (BPS)  form  (without  coarse  space  correction) 
since  the  blocks  now  correspond  to  larger  subinterfaces  rather  than  the  smaller 
interface  edges.  Formally,  BPS  preconditioning  without  coarse  space  correction 
can  be  obtmned  for  2D  elliptic  discretizations  by  dropping  additional  terms  in 
our  Schur  complement  matrix  approximation  and  ordering  unknowns  along  in¬ 
terface  edges  so  that  the  ILU  factorization  of  the  tridiagonal-like  system  for  each 
interface  edge  becomes  exact. 

Block  Element  Dropping  In  our  implementation,  portions  of  the  Schur  com¬ 
plement  residing  on  subdomain  processors  are  gathered  together  on  subinterface 
processors  for  use  in  ILU  preconditioning  of  the  Schur  complement  solve.  In 
assembling  a  Schur  complement  matrix  approximation  on  each  subinterface  pro¬ 
cessor,  certain  matrix  elements  are  neglected: 

1.  All  elements  that  couple  subinterfaces  are  ignored.  This  yields  a  block  Jacobi 
approximation  for  subinterfaces. 

2.  All  elements  with  matrix  entry  location  that  exceeds  a  user  specified  graph 
distance  from  the  diagonal  as  measured  on  the  triangulation  graph  are  ig¬ 
nored.  Recall  that  the  Schur  complement  matrix  can  be  very  dense.  The 
graph  distance  criteria  is  motivated  by  the  rapid  decay  of  elements  away 
from  the  matrix  diagonal  for  elliptic  problems.  In  all  subsequent  calcula¬ 
tions,  a  graph  distance  threshold  of  2  has  been  chosen  for  block  element 
dropping. 

Figures  11(a)  and  11(b)  show  calculations  performed  with  the  present  non¬ 
overlapping  domain-decomposition  preconditioner  for  diffusion  and  advection 
problems.  These  figures  graph  the  number  of  global  FGMRES  iterations  needed 
to  solve  the  discretization  matrix  problem  to  10“®  eiccuracy  tolerance  as  a  func¬ 
tion  of  the  number  of  subproblem  iterations.  In  this  example,  all  the  subproblem 
iteration  parameters  have  been  set  equal  to  each  other  (mi  =  m2  =  m3).  The 
horizontal  lines  show  poor  scalability  of  single  domain  ILU-FGMRES  on  meshes 
containing  2500,  10000,  and  40000  solution  unknowns.  The  remaining  curves 
show  the  behavior  of  the  Schur  complement  preconditioned  FGMRES  on  4,  16, 
and  64  subdomain  meshes.  Satisfactory  scalability  for  very  smedl  values  (5  or  6) 
of  the  subproblem  iteration  parameter  mi  is  clearly  observed. 


Wireframe  Approximation  A  major  cost  in  the  explicit  construction  of  the 
Schur  complement  is  the  matrbc-matrbc  product 

Since  the  subdomain  inverse  is  computed  iteratively  using  ILU-GMRES  itera¬ 
tion,  forming  (25)  is  equivalent  to  solving  a  multiple  right-hand  sides  system  with 
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(a)  Diffusion  dominated  problem.  Uxx  +  (b)  Advection  dominated  problem,  tir  + 
Uyy  =0.  Uy  =  0. 


Fig.  11.  Effect  of  the  subproblem  iteration  parEimeters  mj  on  the  global  FGMRES 
convergence,  mi  =  m2  =  m3  for  meshes  containing  2500,  10000,  and  40000  solution 
unknowns. 


each  right-hand  side  vector  corresponding  to  a  column  of  Ajy  j, .  The  number  of 
columns  of  is  precisely  the  number  of  solution  unknowns  located  on  the 

interf6«:e  surrounding  a  subdomain.  This  computational  cost  czm  be  quite  large. 
Numerical  experiments  with  Krylov  projection  methods  designed  for  multiple 
right-hand  side  systems  [25]  showed  only  marginal  improvement  owing  to  the 
fact  that  the  columns  are  essentially  independent.  In  the  following  paragraphs, 
“wirefr2une”  and  “supersparse”  approximations  are  introduced  to  reduce  the  cost 
in  forming  the  Schur  complenaent  matrix. 

The  wireframe  approximation  idea  [9]  is  motivated  from  standard  elliptic 
domain-decomposition  theory  by  the  rapid  decay  of  elements  in  5  with  graph 
distance  from  the  diagonal.  Consider  constructing  a  relatively  thin  wireframe 
region  surrounding  the  interface  as  shown  in  Fig.  12(a).  In  forming  the  Eqn. 
(25)  expression,  subdomain  solves  are  performed  using  the  much  smaller  wire¬ 
frame  subdomains.  In  matrix  terms,  a  principal  submatrix  of  A,  corresponding  to 
the  variables  within  the  wireframe,  is  used  to  compute  the  (approximate)  Schur 
complement  of  the  interface  variables.  It  is  known  from  domain-decomposition 
theory  that  the  exact  Schur  complement  of  the  wireframe  region  is  spectrally 
equivalent  to  the  Schur  complement  of  the  whole  domain.  This  wireframe  ap¬ 
proximation  leads  to  a  substantial  savings  in  the  computation  of  the  Schur  com¬ 
plement  matrix.  Note  that  the  full  subdomain  matrices  are  used  everywhere  else 
in  the  Schur  complement  algorithm.  The  wirefrzune  technique  introduces  a  new 
adjustable  parameter  into  the  preconditioner  which  represents  the  width  of  the 
wireframe.  For  simplicity,  this  width  is  specified  in  terms  of  graph  distance  on  the 
mesh  triangulation.  Figure  12(b)  demonstrates  the  performance  of  this  approx¬ 
imation  by  graphing  the  total  number  of  preconditioned  FGMRES  iterations 
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(a)  Wireframe  region  surrounding  in¬ 
terface. 


(b)  Effect  of  wireframe  support  on  pre¬ 
conditioner  performance  for  diffusion 
Uxx+Uyy  =  0  and  advection  Ux+Uy  =  0 
problems. 


Fig.  12.  Wireframe  region  surrounding  interface  and  preconditioner  performance  re¬ 
sults  for  a  fixed  mesh  size  (1600  vertices)  and  16  subdomain  partitioning. 


required  to  solve  the  global  matrix  problem  to  a  10  ®  accuracy  tolerance  while 
varying  the  width  of  the  wireframe.  As  expected,  the  quality  of  the  precondi¬ 
tioner  improves  rapidly  with  increasing  wireframe  width  with  full  subdomain-like 
results  obtained  using  modest  wireframe  widths.  As  a  consequence  of  the  wire¬ 
frame  construction,  the  time  taken  form  the  Schur  complement  has  dropped  by 
approximately  50%. 


Supersparse  Matrix- Vector  Operations  It  is  possible  to  introduce  fur¬ 
ther  approximations  which  improve  upon  the  overall  efficiency  in  forming  the 
Schur  complement  matrix.  One  simple  idea  is  to  exploit  the  extreme  sparsity  in 
columns  of  or  equivalently  the  sparsity  in  the  right-hand  sides  produced 

from  needed  in  the  formation  of  the  Schur  complement.  Observe 

that  m  steps  ot  6mRES  generates  a  small  sequence  of  Krylov  subspace  vectors 
[p,  A  p,  ?)•••>  p]  where  p  is  a  right-hand  side  vector.  Consequently  for 
small  m,  if  both  A  and  p  are  sparse  then  the  sequence  of  matrix-vector  products 
will  be  relatively  sparse.  Standard  sparse  matrix- vector  product  subroutines  uti¬ 
lize  the  matrix  in  sparse  storage  format  and  the  vector  in  dense  storage  format. 
In  the  present  application,  the  vectors  contain  only  a  few  non-zero  entries  so  that 
standard  sparse  matrix-vector  products  waste  many  arithmetic  operations.  For 
this  reason,  a  “supersparse”  software  library  have  been  developed  to  take  advan¬ 
tage  of  the  sparsity  in  matrices  as  well  as  in  vectors  by  storing  both  in  compressed 
form.  Unfortunately,  when  GMRES  is  preconditioned  using  ILU  factorization, 
the  Krylov  sequence  becomes  [p,  AP“^  p,  (AP~^)^  p, . . , ,  (A  p-i)”*  p].  Since  the 
inverse  of  the  ILU  approximate  factors  L  and  U  can  be  dense,  the  first  application 
of  ILU  preconditioning  produces  a  dense  Krylov  vector  result.  All  subsequent 
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Krylov  vectors  can  become  dense  as  well.  To  prevent  this  densification  of  vectors 
using  ILU  preconditioning,  a  fill-level-like  strategy  has  been  incorporated  into 
the  ILU  backsolve  step.  Consider  the  ILU  preconditioning  problem,  L  U  rj=-  b. 
This  system  is  conventionally  solved  by  a  lower  triangular  backsolve,  w  =  L~^b, 
followed  by  a  upper  triangular  backsolve  r  =  U~^w.  In  our  supersparse  strategy, 
sparsity  is  controlled  by  imposing  a  non-zero  fill  pattern  for  the  vectors  w  and 
r  during  lower  and  upper  backsolves.  The  backsolve  fill  patterns  are  most  easily 
specified  in  terms  fill-level  distance,  i.e.  graph  distance  from  existing  nonzeros  of 
the  right-hand  side  vector  in  which  new  fill  in  the  resultant  vector  is  allowed  to 
occur.  This  idea  is  motivated  from  the  element  decay  phenomena  observed  for 
elliptic  problems.  Table  1  shows  the  performance  benefits  of  using  supersparse 
computations  together  with  backsolve  fill-level  specification  for  a  2-D  test  prob¬ 
lem  consisting  of  Euler  flow  past  a  multi-element  airfoil  geometry  partitioned 
into  4  subdomains  with  1600  mesh  vertices  in  each  subdomain.  Computations 


Table  1.  Performance  of  the  Schur  complement  preconditioner  with  supersparse  arith¬ 
metic  for  a- 2-D  test  problem  consisting  of  Euler  flow  past  a  multi-element  airfoil  ge¬ 
ometry  partitioned  into  4  subdomains  with  1600  mesh  vertices  in  each  subdomain. 


Backsolve 

Fill-Level  Distance  k 

Global 

GMRES  Iterations 

Time(fc)  /Time(oo) 

0 

26 

0.325 

1 

22 

0.313 

2 

21 

0.337 

3 

20 

0.362 

4 

20 

0.392 

oo 

20 

1.000 

were  performed  on  the  IBM  SP2  parallel  computer  using  MPI  message  pass¬ 
ing  protocol.  Various  values  of  backsolve  fill-level  distance  were  chosen  while 
monitoring  the  number  of  global  GMRES  iterations  needed  to  solve  the  matrix 
problem  and  the  time  t2iken  to  form  the  Schur  complement  preconditioner.  Re¬ 
sults  for  this  problem  indicate  preconditioning  performance  comparable  to  exact 
ILU  backsolves  using  backsolve  fill-level  distances  of  only  2  or  3  with  a  60-70% 
reduction  in  cost. 

6  Numerical  Results  on  the  IBM  SP2 

In  the  remaining  paragraphs,  we  assess  the  performance  of  the  Schur  comple¬ 
ment  preconditioned  FGMRES  in  solving  linear  matrix  problems  associated  an 
approximate  Newton  method  for  the  nonlinear  discretized  compressible  Eu¬ 
ler  equations.  All  calculations  were  performed  on  an  IBM  SP2  parallel  com¬ 
puter  using  MPI  message  passing  protocol.  A  scalability  experiment  was  per- 
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formed  on  meshes  containing  4/1,  16/2,  and  64/4  subdomains/subinterfaces 
with  each  subdomain  containing  5000  mesh  elements.  Figures  13(a)  and  13(b) 


(a)  Mach  contours  (4  subdomains,  20K  (b)  Mach  contours  (16  subdomains, 
elements) .  80K  elements) . 

Fig.  13.  Mach  number  contours  and  mesh  partition  boundaries  for  NACA0012  airfoil 
geometry. 


show  mesh  partitionings  and  sample  Mach  number  solution  contours  for  subsonic 
(Moo  =  -20,  a  =  2.0°)  flow  over  the  airfoil  geometry.  The  flow  field  was  com¬ 
puted  using  the  stabilized  GLS  discretization  and  approximate  Newton  method 
described  in  Sect.  3.  Figure  14  graphs  the  convergence  of  the  approximate  New¬ 
ton  method  for  the  16  subdomain  test  problem.  Each  approximate  Newton  iter¬ 
ate  shown  in  Fig.  14  requires  the  solution  of  a  linear  matrix  system  which  has 
been  solved  using  the  Schur  complement  preconditioned  FGMRES  algorithm. 
Figure  15  graphs  the  convergence  of  the  FGMRES  algorithm  for  each  matrix 
from  the  4  and  16  subdomain  test  problems.  These  calculations  were  performed 
using  ILU[2]  and  mi  =  m2  =  m3  =  5  iterations  on  subprohlems  with  super- 
sparse  distance  equal  to  5.  The  4  subdomain  mesh  with  20000  total  elements 
produces  matrices  that  are  easily  solved  in  9-17  global  FGMRES  iterations.  Cal¬ 
culations  corresponding  to  the  largest  CFL  numbers  are  close  approximations  to 
exact  Newton  iterates.  As  is  typically  observed  by  these  methods,  the  final  few 
Newton  iterates  are  solved  more  easily  than  matrices  produced  during  earlier 
iterates.  The  most  difficult  matrix  problem  required  17  FGMRES  iterations  and 
the  final  Newton  iterate  required  only  12  FGMRES  iterations.  The  16  subdo¬ 
main  mesh  containing  80000  total  elements  produces  matrices  that  are  solved 
in  12-32  global  FGMRES.  Due  to  the  nonlinearity  in  the  spatial  discretization, 
several  approximate  Newton  iterates  were  relatively  difficult  to  solve,  requiring 
over  30  FGMRES  iterations.  As  nonlinear  convergence  is  obtained  the  matrix 
problems  become  less  demanding.  In  this  case,  the  final  Newton  iterate  matrix 
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Fig.  14.  Nonlinear  convergence  behavior  of  the  approximate  Newton  method  for  sub¬ 
sonic  airfoil  flow. 


Global  FGMRES  Matrix-Vector  Products 
(a)  4  subdomains  (20K  elements). 


Global  FGMRES  Matrix-Vector  Products 
(b)  16  subdomains  (80K  elements). 


Fig.  15.  FGMRES  convergence  history  for  each  Newton  step. 


required  22  FGMRES  iterations.  This  iteration  degradation  from  the  4  subdo¬ 
main  case  can  be  reduced  by  increasing  the  subproblem  iteration  parameters  mi , 
m2,  m3  but  the  overall  computation  time  is  increased.  In  the  remaining  timing 
graphs,  we  have  sampled  timings  from  15  FGMRES  iterations  taken  from  the 
final  Newton  iterate  on  each  mesh.  For  example,  Fig.  16(a)  gives  a  raw  timing 
breakdown  for  several  of  the  major  calculations  in  the  overall  solver;  calcula¬ 
tion  of  the  Schur  complement  matrix,  preconditioning  FGMRES  with  the  Schur 
complement  algorithm,  matrix  element  computation  eind  assembly,  and  FGM¬ 
RES  solve.  Results  are  plotted  on  each  of  the  meshes  containing  4,  16,  and  64 
subdomains  with  5000  elements  per  subdomain.  Since  the  number  of  elements 
in  each  subdomain  is  held  constant,  the  time  taken  to  assemble  the  matrix  is 
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Number  of  Subdomains  Number  of  Subdomains 

(a)  Raw  timing  breakdown.  (b)  Interface  Imbalance  and  growth. 

Fig.  16.  Raw  IBM  SP2  timing  breakdown  and  the  effect  of  increased  number  of  sub- 
domains  on  smallest  and  largest  interface  sizes. 


also  constant.  Observe  that  in  our  implementation  the  time  to  form  and  apply 
the  Schur  complement  preconditioner  currently  dominates  the  calculation.  Al¬ 
though  the  growth  observed  in  these  timings  with  increasing  numbers  of  subdo¬ 
mains  comes  from  several  sources,  the  dominate  effect  comes  from  a  very  simple 
source:  the  maximum  interface  size  growth  associated  with  subdomains.  This  has 
a  devastating  impact  on  the  parallel  performance  since  at  the  Schur  complement 
synchronization  point  all  processors  must  wEut  for  subdomains  working  on  the 
largest  interfaces  to  finish.  Figure  16(b)  plots  this  growth  in  maximum  interface 
size  as  a  function  of  number  of  subdomains  in  our  scalability  experiment.  Al¬ 
though  the  number  of  elements  in  each  subdomain  has  been  held  constant  in 
this  experiment,  the  largest  interface  associated  with  any  subdomain  has  more 
than  doubled.  This  essentially  translates  into  a  doubling  in  time  to  form  the 
Schur  complement  matrix.  This  doubling  in  time  is  clearly  observed  in  the  raw 
timing  breakdown  in  Fig.  16(a).  At  this  point  in  time,  we  known  of  no  parti¬ 
tioning  method  that  actively  addresses  controlling  the  maximum  interface  size 
associated  with  subdomains.  We  suspect  that  other  non-overlapping  methods 
are  sensitive  to  this  effect  as  well. 


7  Concluding  Remarks 

Experience  with  our  non-overlapping  domain-decomposition  method  with  an  al¬ 
gebraically  generated  coarse  problem  shows  that  we  can  successfully  trade  off 
some  of  the  robustness  of  the  exact  Schur  complement  method  for  increased 
efficiency  by  making  appropriately  designed  approximations.  In  particular,  the 
localized  wireframe  approximation  and  the  supersparse  matrix-vector  operations 
together  result  in  reduced  cost  without  significantly  degrading  the  overall  con¬ 
vergence  rate. 
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It  remains  an  outstanding  problem  to  partition  domains  such  that  the  max¬ 
imum  interface  size  does  grow  with  increased  number  of  subdomains  and  mesh 

size.  In  addition,  it  may  be  cost  effective  to  combine  this  technique  with  multigrid 

or  multiple-grid  techniques  to  improve  the  robustness  of  Newton’s  method. 
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Abstract.  We  describe  our  parallel  implementation  for  large-eddy  sim¬ 
ulation  and  direct  numerical  simulation  of  turbulent  fluids  (called  Par- 
Distuf)  based  on  the  three-dimensional  incompressible  Navier  -  Stokes 
equation.  Benchmark  results  on  a  set  of  european  supercomputers  under 
the  message-passing  platform  MPI  are  presented.  Using  this  programm 
on  a  48 'node  SP-2  we  resolved  the  inertial  subrange  of  Kolmogorv’s 
turbulence  spectra  for  the  first  time  for  a  stratified  and  sheared  environ¬ 
mental  flow’. 


1  Introduction 

The  Institute  of  Atmospheric  Physics  at  the  German  Aerospace  Research  Facil¬ 
ity  (DLR)  in  Oberpfaffenhofen  is  investigating  the  physics  of  turbulent  fluids. 
The  studies  are  motivated  by  the  need  to  understand  and  predict  the  diffusion 
of  species  concentrations  in  atmospheric  flows  which  often  are  turbulent,  stably 
stratified  and  sheared.  One  special  point  of  interest  is  the  concern  that  exhaust 
gases  from  aircraft  may  influence  the  global  climate.  The  aim  of  the  paralleliza¬ 
tion  activities  is  to  tackle  the  particular  problem  of  the  diffusion  properties  at 
small  scales. 

During  the  last  ten  years  an  extensive  program  for  Direct  numerical  (or  large- 
eddy)  Simulation  of  TUrbulent  Fluid  (Distuf)  at  high  Reynolds  numbers  under 
the  influence  of  shear  and  stable  stratification  was  developed  and  optimized  for 
daily  use  on  vector  computers  such  as  the  Cray  Y-MP  (cf.  [3]).  For  a  simulation 
run  on  128^  gridpoints  with  3000  time  steps  DiSTUF  requires  about  44  MWords 
(64bit)  of  memory  and  eight  hours  of  CPU  time  on  one  processor. 

To  resolve  the  inertial  subrange  of  the  turbulent  energy  spectrum  the  resolu¬ 
tion  has  to  be  increased  to  512^  gridpoints.  This  is  not  feasible  on  nowadays 
vector  computers.  At  this  point  we  decided  to  make  our  code  suitable  for  state- 
of-the-art  massively  parallel  systems  to  access  more  computing  power  and  more 
memory.  The  parallelization  is  based  on  the  concepts  of  message-passing  and 
domain  decomposition.  We  use  MPI  to  achieve  a  high  portability. 

*  Supported  by  Zentrum  fiir  Paralleles  Rechnen  (University  of  Cologne) 
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The  successful!  parallelization  is  demonstrated  by  a  simulation  run  on  a  48  pro¬ 
cessor  SP-2  at  the  DLR.  In  this  run  the  inertial  subrange  of  Kolmogorov’s  spectra 
is  resolved  for  the  first  time  by  a  computer  simulation. 

2  Method  of  simulation 

We  integrate  the  time-dependent,  incompressible,  three-dimensional  Navier  - 
Stokes  and  temperature  concentration  equations  in  a  rectangular  domain  and  in 
time.  The  methods  of  large-eddy  simulation  (LES)  or  direct  numerical  simula¬ 
tion  (DNS)  are  selectable. 

We  consider  a  rectangular  domain  with  coordinates  x,  y,  z  or  x,  {i  €  {1,2,3}), 


Fig.  1.  (a)  Simulation  domain  and  (b)  mean  profiles  of  velocity  and  temperature 


and  side-length  T,;.  The  mean  horizontal  velocity  Uo{z)  and  mean  (reference) 
temperature  To(z)  possess  uniform  and  constant  gradients  relative  to  the  ver¬ 
tical  coordinate  z  and  are  constant  in  the  other  directions.  The  fluid  is  as¬ 
sumed  to  have  constant  molecular  diffusivities  for  momentum  and  heat.  All 
fields  are  expressed  nondimensionally  using  L  :=  L3,  AU  =  \\dUQ/dxi\\L  and 
AT  =  \\dTo/dx3\\L  as  reference  scales  for  length,  velocity  and  temperature.  The 
turbulent  fluctuations  relative  to  these  mean  values  are  u,  (z  €  {1,2,3})  for  ve¬ 
locity,  T  for  temperature  and  p  for  pressure. 

The  normalized  Navier- Stokes-equation,  the  heat  balance,  and  the  continuity 
equation  read 


dui 

dt 


,  d  f  X  ,  c  dui 


^  A. 

dt  dxj 


dT 
'  dxi 


duj 

dxj 


=  _ 

dp 

— - h  III — T  di  j 

(1) 

dxj 

dxi  fi 

1 

II 

(2) 

=  0, 

(3) 
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where  S  =  {L ! AU){dUo ! dx^)  €  {0,1}  is  the  nondimensional  shear  parameter, 
s  =  iLIAT){dTo/dx3)  €  {-1,0, 1}  is  the  stratification  parameter,  and  6ij  is  the 
Kronecker-Delta.  Ri  is  the  gradient  Richardson  number,  nj  and  ttj  denote  the 
diffusive  fluxes  of  momentum  and  heat.  Additionaly  the  distribution  of  three 
passive  scalars  can  be  calcultated  by  solving  transport  equations  for  each  one. 
The  equations  are  discretized  in  an  equidistant  Eulerian  framework  using  a 
second-order  finite-difference  technique  on  a  staggered  grid  for  all  the  terms 
in  the  equations  except  the  mean  advection,  where  pseudo-spectral  (Fourier) 
approximation  in  i-direction  is  used.  The  Adams-Bashforth  scheme  is  employed 
for  time  integration  of  the  acceleration  terms.  The  pressure  at  the  new 

time-level  n  -H  1  is  obtained  by  solving  the  Poisson  equation  in  finite  difference 
form  (4),  with  5,  as  the  common  finite  difference  operator  and  Ui  denoting  the 
velocity  terms  resulting  from  the  Adams-Bashforth  scheme. 

(4) 

The  solution  of  (4)  is  obtained  using  a  fast  Poisson  solver,  which  includes  the 
shear-periodic  boundary  condition  at  time  and  applies  a  combination  of  fast 
Fourier  transforms  and  Gaussian  elimination.  Finally  the  velocities  are  updated 
by  the  new  pressure  terms. 


3  Parallel  approach 

Our  main  aim  was  to  develop  a  parallel  code  which  is  not  only  efficient,  scalable, 
and  numerically  correct,  but  also  portable  on  a  wide  range  of  supercomputers. 
For  this  reason  we  decided  to  take  advantage  of  the  MPI  message  passing  stan¬ 
dard.  The  first  experiences  with  the  MPI  implementations  show  that  this  was 
the  right  decision.  We  were  able  to  port  the  programm  to  a  wide  set  of  computers 
very  fast  and  without  any  changes  concerning  the  message  passing  calls. 


3.1  Domain  decomposition 

Analvsing  the  sequential  code  we  found  out  that  under  the  aspect  of  parallelism 
the  one-,  two-  and  three-dimensional  fast  Fourier  transforms  (FFT)  are  the  most 
critical  parts  of  the  program.  The  one-dimensional  FFT  in  x-direction  is  used  in 
everv  time  step  in  order  to  consider  the  shear  flow  in  the  horizontal  velocit\  com¬ 
ponents.  The  two-dimensional  FFT  in  x-  and  y-direction  is  a  part  of  the  Poisson 
solver  for  equation  (4).  For  statistic  evaluations  in  the  Fourier  space,  which  can 
be  done  at  user  defined  intervals,  we  need  a  parallel  three-dimensional  Fouiiei 
transform,  too. 

The  best  way  of  implementing  two-  and  three-dimensional  FFT  on  parallel  com¬ 
puters  is  to  treat  them  as  a  sequence  of  one-dimensional  transforms,  which  aie 
computed  independently  on  the  processors,  (cf.  [1],[8],[2]).  This  is  no  problem  at 
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all  on  shared  memory  computers  and  can  be  implemented  on  distributed  mem¬ 
ory  machines  by  using  efEcent  2dgorithms  for  data  transposition.  For  this  reason 
we  partition  the  three-dimensional  grid  into  subdomains  allowing  to  perform  a 
one-dimensional  Fourier  transform  in  x-direction  without  communication.  That 
means  that  the  domain  is  decomposed  in  horizontal  bars  in  x-direction  (two- 
dimensional  decomposition)  or  in  horizontal  (or  vertical)  planes  in  x/y  (or  x/z) 
direction,  what  we  call  one-dimensional  decomposition.  According  to  this,  the 
process  topology  is  a  one-dimensional  or  a  two-dimensional  grid  (fig.  2).  With 
the  described  domain  partitioning  we  can  perform  all  necessary  Fourier  trans¬ 
forms  in  y  and  2  direction  after  data  transposition,  which  can  be  implemented 
very  efficiently  using  MPI  calls  (cf.  [6]). 

In  the  Poisson  solver  we  have  to  solve  tridiagonal  systems  of  equations  dis- 


Fig.  2.  Datadecomposition:  One-dimensional  (a,  b)  and  two-dimensional  (c).  The  ar¬ 
rows  show  the  distribution  of  the  tridiagonal  systems  (5) 


tributed  in  z-direction.  This  is  another  point  where  we  have  to  think  about  an 
efficient  parallel  algorithm.  Our  approach  for  this  is  described  in  chapter  3.2.  All 
other  parts  of  the  algorithm  can  be  calculated  independently  on  all  proc  essors,  if 
we  ensure  an  overlap  of  one  row  or  column  of  gridpoints  in  each  direction,  which 
has  to  be  updated  after  each  time  step. 


3.2  Parallel  Poisson  solver 

For  pressure  correction  we  have  to  solve  the  Poisson  equation 

duijx)  d-p 

dx,  dx'j 

for  the  pressure  p  in  each  time  step.  This  is  done  by  two-dimensional  Fouricu 
transforms  of  the  left  handsides  in  both  horizontal  directions,  which  results  iii 
{Ni/2  +  1)  *  Nj  =  number  of  gridpoints  in  x,  y.  and  2  direction)  tridi¬ 

agonal  systems  of  equations  of  rank  Nj^-  with  shear-periodic  boundaries  (a;  is  a 
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System 

(manufacturer) 

Processor 

(manufacturer) 

Max. 
number 
of  proc. 

Memory 
p.  proc. 
(MB) 

Performanc 

peak 

(MFlop/s) 

e  per  proc. 
SPEC 
fp923 

cache 

(KB) 

Ma«;sivlv-Darallel  systems  with  distributed,  memory. 

GC/PowerPlus 

(Parsytec) 

601-i- 

( PowerPC) 

192 

(96) 

32 

(64) 

80 

125 

32 

SP2 

(IBM) 

Power2 

(IBM) 

58 

(+8) 

256 

(128) 

267 

(133) 

244.6 

(202.1) 

256 

(128) 

T3D 

(Cray) 

alpha  21064 
(DEC) 

512 

64 

150 

200 

8 

ISvstems  with  shared  memory: 

Power 

Challenge  XL 
(SGI/SNI) 

R8000 

(MIPS) 

16 

500 

300 

311 

4.000 

Ultra 

Enterprise  4000 
(SUN) 

■■■ 

8 

128 

unknown 
to  the 
author 

386 

512 

1  Parallel  vector  machines: 

J916 

(Cray) 

16 

256 

200 

Table  1.  Technical  data  of  the  parallel  systems 


complex  shear  factor): 

Pi-i  -  2pi  +  pi+i  =  Ui  ,  i  =  1,2,. . .  ,N[^  pi,Ui  €:  C  (5) 

PO  =  CJPNk,  PNk+I  = 

The  components  pi ,  Ui  of  each  of  these  systems  are  distributed  in  z-direction  on 
the  grid.  The  systems  themselves  are  distributed  in  x-  and  y-direction  (fig.  2). 
After  solving  the  equations  the  results  are  transformed  back,  and  we  finally  get 
the  pressure  terms  p  for  the  next  time  step. 

For  parallelization  we  must  distinguish  between  the  phase  of  Fourier  transforms 
and  the  algorithm  for  solving  the  tridiagonal  systems.  The  Fourier  transforms 
can  be  done  simultaneously  on  the  distributed  datasets.  In  case  of  the  two- 
dimensional  decomposition  we  have  to  include  a  data  transposition  step. 

The  distributed  equations  can  be  solved  independently  by  the  subsets  of  pro¬ 
cesses  with  the  same  x  and  y  coordinates.  Each  of  the  in  z-direction  distributed 
systems  (5)  is  solved  by  an  improved  divide  &  conquer  method  (cf.  [7])  based 
on  an  algorithm  from  Mehrmann  (cf.[5]). 

4  Experiences  on  parallel  systems 

We  had  the  opportunity  to  run  the  parallel  turbulence  simulation  code  on  a  set 
of  parallel  computers,  including  shared  memory  and  distributed  memory  sys- 
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terns.  The  machines  are  summarized  in  table  1. 

The  parallel  test  runs  are  compared  with  a  run  on  one  processor  of  the  CRAY 
J916  with  256  MB  of  memory  and  a  peak  performance  of  200  MFlops.  This  is 
the  machine  the  original  code  is  designed  and  highly  optimized  for. 

In  order  to  evaluate  performance  data  on  all  machines  we  run  simulations  with 


Fig.  3.  Runtimes  for  one  timestep  on  128^  gridpoints. 


shear,  stratification,  and  three  passive  scalars  about  64  time  steps  on  128"*  grid- 
points.  Here  we  discuss  the  timings  for  one  timestep  in  this  run. 

On  the  shared  memory  systems  we  get  a  speed-up  of  7.44  on  eight  SGI  proces.sors 
and  one  of  6.56  on  eight  SUN  CPUs.  On  more  than  8  processors  of  the  Power 
Challenge  this  value  couldn’t  be  increased.  This  restriction  depends  on  the  hxed 
bandwidth  of  the  underlying  communication  system,  the  SGI  data  bus. 

A  totaly  different  picture  is  the  speed-up  on  the  distributed  systems.  The  tim¬ 
ing  results  are  presented  in  figure  3.  Only  on  the  SP2  the  128*^  grid  fits  on  one 
processor  and  here  we  get  a  speed-up  of  62.95  for  64  processors. 

The  best  timing  results  for  one  timestep  has  the  SP2,  which  with  16  processors 
is  3.5  times  faster  than  the  Parsytec  GC  or  the  Cray  T3D.  The  T3D  with  128 
CPUs  is  still  2  times  slower  than  64  SP2  processors,  but  2  times  faster  than  the 
Parsytec  GC, 

On  the  IBM  SP2  and  the  SUN  Enterprise  we  need  at  least  three  processors  to 
get  the  same  performance  than  one  CRAY  J90  vector  processor.  On  the  SGI 
Power  Challenge  two  CPUs  already  are  faster  than  the  J90. 


214 


VECPAR  '98  ■  3rd  International  Meeting  on  Vector  and  Parallel  Processing 


The  results  show  that  our  parallel  code  has  a  good  efficiency  on  medium  sized 
shared  memory  and  larger  distributed  memory  machines.  But  not  only  the  per¬ 
formance  is  important.  Running  the  programm  on  48  fat  nodes  of  the  SP2  allows 
as  to  use  12  Gbyte  memory.  Therfore  we  are  able  to  compute  grids  with  480 
gridpoints,  this  is  more  than  8  times  larger  than  on  vector  machines. 


5  Resolving  the  Inertial  Subrange 

In  the  Kolmogorov  spectra  of  turbulence  energy  [4],  the  kinetic  energy  in  the 
flow  is  plotted  over  all  wavenumbers,  which  are  reciprocal  to  the  size  of  the  eddy- 
structurs.  This  spectra  can  be  divided  into  three  subranges;  production,  inertia, 
and  dissipation.  In  the  inertial  subrange  the  flow  has  universal  properties,  which 
do  not  depend  on  the  geometrie  or  other  physical  parameters  of  the  flow.  The 
existence  of  the  inertial  subrange  depends  on  the  Reynolds  number  of  the  flow. 
In  direct  numerical  simulation  this  is  directly  correlated  to  the  grid  resolution. 
On  the  SP2  we  were  able  to  manage  one  run  on  480^  gridpoints  with  a  Reynolds 


Fig.  4.  Spectra  of  the  kinetic  energy  of  the  simulation  with  480®  gridpoints,  Reynolds 
number  600  (based  on  velocity  fluctuation  and  integral  scale) 


number  of  600.  By  using  a  gradient  Richardson  number  of  0.13  we  force  the  flow 
to  become  stationary,  that  means  the  kinetic  energy  becomes  constant.  After  an 
initial  phase  this  flow  shows  a  self  similar  energy  spectra  with  a  decay  as  k 
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for  20  <  A:  <  50.  This  is  the  main  indication  for  a  resolved  inertial  subrange. 
For  our  knowledge,  this  is  the  first  time  that  the  Kolmogorv’s  inertial  subrange 
could  be  resolved  in  a  computer  simulation. 


6  Conclusion 

A  fully  parallelized  version  of  an  incompressible  turbulence  simulation  code  has 
been  presented.  The  parallel  code  achieves  2.446  GFlop/s  on  64  SP2  processors, 
this  is  26.3  times  faster  than  on  one  J90  vector  processor.  Due  to  the  message 
passing  standard  MPI  we  achieved  a  perfect  portability.  The  developed  code  is 
now  fitted  for  the  use  on  state-of-the-art  parallel  computers. 

We  have  demonstrated  that  the  concept  on  message  passing  is  not  exclusive  for 
distributed  memory  machines  and  we  have  shown  that  a  MPI  implementation 
on  a  few  processors,  which  derives  advantage  from  the  fast  communication  possi¬ 
bilities  on  shared  memory  machines,  can  be  better  suited  for  parallel  computing 
than  on  some  other  dedicated  message  passing  machines. 

On  the  other  hand  we  developed  a  parallel  code  which  already  leads  to  new 
physical  results.  Fir.st  in  the  world,  we  resolved  the  inertial  subrange  of  a  homo¬ 
geneously  turbulent  and  stratified  shear  flow  by  a  direct  numerical  simulation. 
This  demonstrates  the  way  how  parallel  computing  can  open  the  door  for  new 
fundamental  results  in  physics. 

References 

1.  Martin  Biicker.  Zweidimensionale  Schnelle  Fourier-Transformation  auf  massiv  par- 
allelen  Rechnern.  Technical  Report  Jul-2833,  ZAM,  Forschungszentrum  Jiilich,  D- 
52425  Jiilich,  November  1993. 

2.  Clcire  Yung-Lei  Chu.  The  fast  Fourier  transform  on  hypercube  parallel  computers. 
PhD  thesis,  Cornell  University,  1988. 

3.  Thomas  Gerz,  Ulrich  Schumann,  and  S.  E.  Elghobashi.  Direct  numerical  simulation 
of  stratified  homogeneous  turbulent  shear  flows.  J.  Fluid  Mech.,  200:563-594.  1989. 

4.  A.  N.  Kolmogorov.  The  local  structure  of  turbulence  in  incompressible  viscous 
fluid  for  very  large  Reynolds  number.  C.  R.  Acad.  Nauk  SSSR.  30:301-303,  1941. 
Reprinted  in  Proc.  Soc.  Lond.  A,  434,  9-13(1991). 

5.  Volker  Mehrmann.  Divide  and  conquer  methods  for  block  tridiagonal  systems. 
Parallel  Comput.,  19:257-279,  1992. 

6.  Message  Passing  Interface  Forum.  MPI:  A  message-passing  interface  standard.  June 
1995. 

7.  Ulrich  Schumann  and  Martin  Strietzel.  Parallel  solution  of  tridiagonal  systems  for 
the  Poisson  equation.  J.  Sci.  Comput.,  10(2):181-190,  Juni  1995. 

8.  Paul  N.  Swarztrauber.  Multiprocessor  FFTs.  Parallel  Comput.,  5:197-210.  1987. 


This  article  was  processed  using  the  DTeX  macro  package  with  LLNCS  style 


216 


VECPAR  ’98  ■  3rd  International  Meeting  on  Vector  and  Parallel  Processing 


A  Systolic  Algorithm  for  the  Factorisation  of  Matrices 
Arising  in  the  Field  of  Hydrodynamics 


S.-G.  Seo',  M.  J.  Downie',  G.  E.  Hearn'  and  C.  Phillips* 

‘Department  of  Marine  Technology,  University  of  Newcastle  upon  Tyne, 
Newcastle  upon  Tyne,  NEl  7RU,  UK 

’Department  of  Computing  Science,  University  of  Newcastle  upon  Tyne, 
Newcastle  upon  Tyne,  NEl  7RU,  UK 


Abstract.  Systolic  algorithms  often  present  an  attractive  parallel 
programming  paradigm.  However,  the  unavailability  of  specialised  hardware 
for  efficient  implementation  means  that  such  algorithms  are  often  dismissed 
as  being  of  theoretical  interest  only.  In  this  paper  we  report  on  experience 
with  implementing  a  systolic  algorithm  for  matrix  factorisation  and  present  a 
modified  version  that  is  expected  to  lead  to  acceptable  performance  on  a 
distributed  memory  multicomputer.  The  origin  of  the  problem  that  generates 
the  full  complex  matrix  in  the  first  place  is  in  the  field  of  hydrodynamics. 


1.  Forming  the  Linear  System  of  Equations 

The  efficient,  safe  and  economic  design  of  large  floating  offshore  structures  and 
vessels  requires  a  knowledge  of  how  they  respond  in  an  often  hostile  wave 
environment  [1].  Prediction  of  the  hydrodynamic  forces  experienced  by  them  and 
their  resulting  responses,  which  occur  with  six  rigid  degrees  of  freedom,  involves 
the  use  of  complex  mathematical  models  leading  to  the  implementation  of 
computationally  demanding  software.  The  solution  to  such  problems  can  be 
formulated  in  terms  of  a  velocity  potential  involving  an  integral  expression  that  can 
be  thought  of  as  representing  a  distribution  of  sources  over  the  wetted  surface  of  the 
body.  In  most  applications  there  is  no  closed  solution  to  the  problem  and  it  has  to  be 
solved  numerically  using  a  discretisation  procedure  in  which  the  surface  of  the  body 
is  represented  by  a  number  of  panels,  or  facets.  The  accuracy  of  the  solution 
depends  on  a  number  of  factors,  one  of  which  is  the  resolution  of  the  discretisation. 
The  solution  converges  as  resolution  becomes  finer  and  complicated  geometries  can 
require  very  large  numbers  of  facets  to  attain  an  acceptable  solution. 

In  the  simplest  approach  a  source  is  associated  with  each  panel  and  the  interaction 
of  the  sources  is  modelled  by  a  Green  function  which  automatically  satisfies  relevant 
‘wave  boundary  conditions’.  The  strength  of  the  sources  is  determined  by  satisfying  a 
velocity  continuity  condition  over  the  mean  wetted  surface  of  the  body.  This  is 
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achieved  by  setting  up  an  influence  matrix,  A ,  for  the  sources  based  on  the  Green 
functions  and  solving  a  linear  set  of  algebraic  equations  in  which  the  unknowns,  x, 
are  either  the  source  strengths  or  the  velocity  potential  values  and  the  right-hand 
side,  b,  are  related  to  the  appropriate  normal  velocities  at  a  representative  point  on 
each  facet.  For  a  given  wave  frequency,  each  facet  has  a  separate  source  contributing 
to  the  wave  potential  representing  the  incoming  and  diffracted  waves,  tpo  and  0;,  and 
one  radiation  velocity  potential  for  each  degree  of  freedom  of  the  motion,  (p,  : 
;=1,2,...,6.  When  the  velocity  potentials  have  been  determined,  once  the  source 
strengths  are  known,  the  pressure  can  be  computed  at  every  facet  and  the  resultant 
forces  and  moments  on  the  body  computed  by  integrating  them  over  the  wetted 
surface.  The  forces  can  then  be  introduced  into  the  equations  of  motion  and  the 
response  of  the  vessel  at  the  given  wave  frequency  calculated. 

The  complexity  of  the  mathematical  form  of  the  Green  functions  and  the 
requirement  to  refine  the  discretisation  of  the  wetted  surfaces  within  practical 
offshore  analyses,  significantly  increases  the  memory  and  computational  load 
associated  with  the  formulation  of  the  required  fluid-structure  interactions.  Similarly 
the  solution  of  the  very  large  dense  square  matrix  equations  formulated  in  terms  of 
complex  variables  requires  considerable  effort  to  provide  the  complex  variable 
solution.  In  some  problems  5,000  panels  might  be  required  using  constant  plane 
elements  or  5,000  nodes  using  higher  order  boundary  elements,  leading  to  a  matrix 
with  25,000,000  locations.  Since  double  precision  arithmetic  is  required,  and  the 
numbers  are  complex,  this  will  require  memory  of  the  order  of  4  gigabytes.  The 
number  of  operations  for  direct  inversion  or  solution  by  iteration  is  large  and  of  the 
order  of /?^  e.g.  5,000  elements  requires  125,000  x  10*  operations.  Furthermore,  the 
sea-state  for  a  particular  wave  environment  has  a  continuous  frequency  spectrum 
which  can  be  modelled  as  the  sum  of  a  number  of  regular  waves  of  different 
frequencies  with  random  phase  angles  and  amplitudes  determined  by  the  nature  of 
the  spectrum.  Determination  of  vessel  responses  in  a  realistic  sea-state  requires  the 
solution  of  the  boundary  integral  problem  described  above  over  a  range  of  discrete 
frequencies  sufficiently  large  to  encompass  the  region  in  which  the  wave  energy  of 
the  irregular  seas  is  concentrated.  In  view  of  the  size  and  complexity  of  such 
problems,  and  the  importance  of  being  able  to  treat  them,  it  is  essential  to  develop 
methods  to  speed  up  their  formulation  and  solution  times.  One  possible  means  of 
achieving  this  is  through  the  use  of  parallel  computers  [2]  [3], 

Since  A  is  a  full,  square  matrix,  and  in  view  of  the  uncertainty  regarding  the 
convergence  of  iterative  methods,  the  use  of  a  direct  method  of  solution  based  on 
elimination  techniques  would  seem  the  most  attractive  proposition.  The  method  of 
Lf/-decomposition  has  been  chosen  because  in  this  scheme  only  one  factorisation  is 
required  for  multiple  unknown  right-hand  side  vectors  b.  It  is  well  known  that  this 
factorisation  is  computationally  intensive,  involving  order  arithmetic  operations 
(multiplications  and  additions).  In  contrast  the  forward-  and  backward-substitution 
required  to  solve  the  original  system  once  the  factorisation  has  been  performed 
involves  an  order  of  magnitude  less  computation,  namely  order  n\  which  becomes 
insignificant  as  the  matrix  size  increases.  Consequently,  we  limit  consideration  to 
the  factorisation  process  only. 
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Formally,  we  have  that  the  elements  Uy  of  U  are  given  by 


;-l 

~  ^rj  Kk  ^kj 


k=\ 


and  /,.  of  L  are  given  by 

f 

4  =  a,, 

V  k  =  ] 


/  M. 


J  =  r, . ,n 


i  =  r+ 


(1) 

(2) 


(Doolittle  factorisation)  leading  to  an  upper-triangular  U  and  a  unit  lower-triangular 
L. 


2.  A  Naive  Systolic  Algorithm  Solution 

A  systolic  system  can  be  envisaged  as  an  array  of  synchronised  processing  elements 
(PEs),  or  cells,  which  process  data  in  parallel  by  passing  them  from  cell  to  cell  in  a 
regular  rhythmic  pattern.  Systolic  arrays  have  gained  popularity  because  of  their 
ability  to  exploit  massive  parallelism  and  pipelining  to  produce  high  performance 
computing  [4]  [5].  Although  systolic  algorithms  support  a  high  degree  of 
concurrency,  they  are  often  regarded  as  being  appropriate  only  for  those  machines 
specially  built  for  the  particular  algorithm  in  mind.  This  is  because  of  the  inherent 
high  communication/computation  ratio. 

In  a  soft-systolic  algorithm,  the  emphasis  is  on  retaining  systolic  computation  as  a 
design  principle  and  mapping  the  algorithm  onto  an  available  (non-systolic)  parallel 
architecture,  with  inevitable  trade-offs  in  speed  and  efficiency  due  to  communication 
and  processor  overheads  incurred  in  simulating  the  systolic  array  structure. 

Initially  a  systolic  matrix  factorisation  routine  was  written  in  Encore  Parallel 
Fortran  (epf)  and  tested  on  an  Encore  Multimax  520.  This  machine  has  seven  dual 
processor  cards  (a  maximum  of  ten  can  be  accommodated),  each  of  which  contains 
two  independent  10  MHz  National  Semiconductor  32532  32-bit  PEs  with  LSI 
memory  management  units  and  floating-point  co-processors.  This  work  was 
undertaken  on  a  well-established  machine  platform  that  was  able  to  provide  a 
mechanism  for  validating  the  model  of  computation,  and  the  software  developed 
from  that  model,  with  a  view  to  refining  the  model  for  subsequent  development  on  a 
state-of-the-art  distributed  memory  parallel  machine,  epfs  parallel  program 
paradigm  is  much  simpler  to  work  with  than  that  for  message  passing  on  distributed 
memory  architectures  and  produces  a  convenient  means  of  providing  validation  data 
for  subsequent  developments.  As  long  as  the  underlying  algorithm  is  retained  this 
implementation  can  be  used  to  check  the  correctness  of  the  equations  developed  and 
the  validity  of  the  algorithm  with  a  view  to  an  eventual  port. 

The  task  of  systolic  array  design  may  be  defined  more  precisely  by  investigating 
the  cell  types  required,  the  way  they  are  to  be  connected,  and  how  data  moves 
through  the  array  in  order  to  achieve  the  desired  computational  effect.  The 
hexagonal  shaped  cells  employed  here  are  due  to  Kung  and  Leiserson  [6]. 
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Fig.  1.  Data  flow  for  LU  factorisation 
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The  manner  in  which  the  elements  of  L  and  U  are  computed  in  the  array  and 
passed  on  to  the  cells  that  require  them  is  demonstrated  in  Fig.  1.  An  example  of  a 
4x4  cell  array  is  shown  for  the  purposes  of  illustrating  the  paradigm  employed.  Of 
the  PEs  on  the  upper-right  boundary,  the  top-most  has  three  roles  to  perform: 

1.  Produce  the  diagonal  components  of  L  (which,  for  a  Doolittle-based 
factorisation,  are  all  unit). 

2.  Produce  the  diagonal  components  of  U  using  elements  of  A  which  have 
filtered  through  the  cells  along  the  diagonal  (and  been  modified,  as  appropriate). 

3.  Pass  the  reciprocal  of  the  diagonal  components  of  U  down  and  left. 

The  cells  on  the  upper  left  boundary  are  responsible  for  computing  the  multipliers 
(the  elements  of  L),  having  received  the  appropriate  reciprocals  of  the  diagonal 
elements  of  U . 

The  flow  of  data  through  the  systolic  array  is  shown  in  Fig.  1.  Elements  of  A  flow 
in  an  upward  direction;  elements  of  L  (computed  from  (1))  flow  in  a  right-and-down 
direction;  and  elements  of  U  (computed  from  (2))  flow  in  a  left-and-down  direction. 
Each  cell  computes  a  value  every  3  clock  ticks,  although  they  start  at  different  times. 
Note  that  at  each  time  step  each  cell  responsible  for  forming  an  element  of  L  or  U 
calculates  one  multiplication  only  in  forming  a  partial  summation.  Data  flowing  out 
correspond  to  the  required  elements  of  the  L  and  U  factors  of  A  .  Formally,  the 
elements  of  A  are  fed  into  the  cells  as  indicated,  although  for  efficiency  the  cells  are 
directly  assigned  the  appropriate  elements  of  A  . 

Table  1  shows  the  execution  times  (7),)  of  the  parallel  code  with  a  fixed-size 
(100*100  double  complex)  matrix  and  various  numbers  of  PEs  {p).  The  gradual 
algorithmic  speed-up  (5,,),  defined  as  the  ratio  of  the  time  to  execute  the  program  on 
p  processors  to  the  time  to  execute  the  same  parallel  program  on  a  single  processor, 
is  clearly  seen  all  the  way  up  to  twelve  PEs.  The  (generally)  decreasing  efficiency 
(£,,),  defined  as  the  ratio  of  speed-up  to  the  number  of  PEs  times  100,  is  a 
consequence  of  the  von  Neumann  bottleneck.  The  results  show  some  minor 
anomalies,  but  this  is  not  atypical  when  attempting  to  obtain  accurate  timings  on  a 
shared  resource,  with  other  processes  -  system  or  those  of  other  users  -  interfering 
with  program  behaviour,  even  at  times  of  low  activity.  At  this  level,  the  results  are 
encouraging. 


Table  1.  Shared  memory  implementation 


V 

1 

2 

3 

4 

5 

6 

7 

8 

9 

10 

11 

12 

T, 

48.9 

28..^ 

20.2 

14.9 

12.3 

10.6 

8.9 

8,0 

7.3 

6.6 

6.2 

6.1 

S, 

1 

1.7 

2.4 

.3.3 

4.0 

4,6 

5.5 

6.1 

6.7 

7.4 

7.9 

8.0 

E, 

100 

87 

81 

82 

80 

77 

79 

77 

74 

74 

72 

68 

The  algorithm  was  compared  with  an  existing  parallel  matrix  factorisation  routine 
[7].  which  uses  more  conventional  techniques,  available  in  the  Department  of 
Marine  Technology  (DMT).  The  results  are  summarised  in  Fig  2,  where  5, 
denotes  the  speedup  for  the  systolic  algorithm  (from  Table  1)  and  speedup 
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tor  the  DMT  algorithm.  The  extra  cost  of  the  systolic  algorithm  due  to  overheads 


Fig.  2.  Comparison  of  speedup  for  systolic  and  DMT  algorithms. 

associated  with  index  calculation  and  array  accesses  causes  significant  delays  in  the 
computation  resulting  in  much  poorer  performance  of  the  systolic  algorithm  in 
comparison  to  the  DMT  routine  in  terms  of  elapsed  time.  Nevertheless,  the  systolic 
algorithm  shows  better  speedup  characteristics  than  the  DMT  algorithm,  as 
illustrated  by  Fig.  2. 

It  A  is  an  n  by  n  dense  matrix  then  the  systolic  algorithm  implemented  on  ti^  PEs 
can  compute  L  and  U  in  4n  clock  ticks,  giving  a  cell  efficiency  of  33% .  Assuming  a 
hardware  system  in  which  the  number  of  PEs  is  much  less  than  the  number  of  cells, 
and  using  an  appropriate  mapping  of  cells  to  PEs,  we  can  improve  this  position 
considerably,  and  we  now  address  this  issue. 


3.  An  Improved  Systolic  Algorithm 

As  already  indicated,  the  ultimate  goal  is  to  produce  an  efficient  systolic  matrix 
factorisation  routine  for  general-purpose  distributed  parallel  systems  including 
clusters  of  workstations.  This  is  to  be  achieved  by  increasing  the  granularity  of  the 
computation  within  the  algorithm  and  thus  reducing  the  "communication/ 
computation  ratio,  while  balancing  the  load  on  each  PE  to  minimise  the  adverse 
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effect  due  to  enforced  synchronisation.  Nevertheless,  the  characteristics  peculiar  to 
systolic  algorithms  should  be  retained.  Thus  we  aim  at 

•  massive,  distributed  parallelism 

•  local  communication  only 

•  a  synchronous  mode  of  operation 

Each  PE  will  need  to  perform  far  more  complex  operations  than  in  the  original 
systolic  systems  used  as  the  original  basis  for  implementation.  There  is  an  inevitable 
increase  in  complexity  from  the  organisation  of  the  data  handling  required  at  each 
synchronisation  (message-passing)  point. 


PEs  0  .  1  2  3  0  1 


Fig.  3.  Allocation  of  pseudo  cells  to  PEs  for  a  6x6  matrix. 

The  systolic  algorithm  can  be  regarded  as  a  wave  front  passing  through  the  cells 
in  an  upwards  direction  in  Fig.  1.  This  means  that  all  pseudo  cells  in  a  horizontal 
line,  corresponding  to  a  reverse  diagonal  of  A,  become  active  at  once.  It  follows  that 
we  allocate  the  whole  of  a  reverse  diagonal  to  a  PE,  and  distribute  the  reverse 
diagonals  from  the  top  left  to  the  bottom  right  in  a  cyclic  manner  so  as  to  maintain 
an  even  load  balance  (for  a  suitably  large  matrix  size).  Fig.  2  shows  such  a 
distribution  for  a  6x6  matrix  distributed  on  4  PEs. 
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The  computation  starts  at  PEO  with  pseudocell  1.  As  time  increases,  so  the 
computation  domain  over  the  matrix  domain  increases,  and  later  shrinks.  The  shape 
of  the  computation  domain  is  initially  triangular,  to  include  the  first  few  reverse 
diagonals.  On  passing  the  main  reverse  diagonal,  the  computation  domain  becomes 
pentagonal,  and  remains  so  until  the  bottom  right-hand  corner  is  reached,  when  it 
becomes  a  quadrilateral.  Once  the  computation  domain  has  covered  the  whole  of  the 
domain  of  pseudo  cells  it  shrinks  back  to  the  top  left,  whilst  retaining  its 
quadrilateral  shape.  The  whole  process  is  completed  in  3n-2  timesteps. 

A  Fortran  implementation  of  the  revised  systolic  algorithm  is  currently  under 
development  using  the  distributed  memory  Cray  T3D  at  the  Edinburgh  Parallel 
Computer  Centre  (EPCC),  and  the  Fujitsu  APIOOO  at  Imperial  College,  London, 
and  MPI  [8]  [9]  for  message  passing. 


Fig.  4.  Speedup  on  distributed  memory  machine  for  400  by  400  array 


In  this  implementation  the  elements  in  each  reverse  (top-right  to  bottom-left) 
diagonal  of  the  matrix  are  bundled  together  so  that  they  are  dealt  with  by  a  single  PE 
and  all  of  the  reverse  diagonals  which  are  active  at  a  given  time  step  are  again 
grouped  together  to  be  passed  around  as  a  single  message.  Preliminary  experience 
with  the  revised  algorithm  indicates  that  it  scales  well,  as  shown  by  the  speed  up 
figures  for  a  400  by  400  array  computed  on  the  Fujitsu  machine  and  illustrated  in 
Fig.  4. 
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4.  Conclusions 

The  algorithm  described  initially  was  a  precursor  to  a  more  generalized  one 
designed  with  the  intention  of  increasing  the  granularity  of  computation  and  with 
the  solution  of  large  systems  of  equations  in  mind.  As  expected,  it  performed  poorly 
in  terms  of  elapsed  time  due  to  the  penalties  imposed  by  an  unfavourable  balance 
between  processor  communication  and  computation.  However,  the  speedup 
characteristics  compared  favourably  with  those  of  a  more  conventional  approach 
and  pointed  to  the  potential  for  the  generalised  algorithm. 

Accordingly,  a  generalised  version  of  the  algorithm  has  been  developed  and  is 
currently  being  tested.  The  speedup  obtained  on  the  distributed  memory  machine 
suggest  that  the  approach  taken  is  valid.  It  remains  to  benchmark  the  algorithm 
against  a  conventional  solver,  such  as  the  one  available  within  the  public  domain 
software  pachage,  ScaLAPACK. 
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Abstract.  The  current  study  discusses  the  results  of  parallelization  of 
a  computer  program  based  on  the  SIMPLE  algorithm  and  applied  to 
the  prediction  of  the  laminar  backward-facing  step  flow.  The  domain  of 
integration  has  been  split  into  subdomains,  as  if  the  flow  were  made  up 
of  physically  distinct  domains  of  integration. 

The  convergence  characteristics  of  the  parallel  algorithm  have  been  studied 
as  a  function  of  grid  size,  number  of  subdomains  and  flow  Reynolds  num¬ 
ber.  The  results  showed  that  the  difficulties  of  convergence  increase  with 
the  complexity  of  the  flow,  as  the  Reynolds  number  increases  and  extra 
recirculation  regions  appear. 


1  Introduction 

Parallel  computing  has  become  more  and  more  common,  and  has  developed 
from  a  subject  of  specialized  scientific  meetings  and  journals  into  an  affordable 
technology,  to  a  point  where  we  feel  that  no  references'  are  needed  to  support 
this  statement. 

Fluid  flow  algorithms  are  complex,  given  the  number  of  equations  involved, 
the  elliptic  nature,  non-linearity  and  peculiarities  of  the  pressure-velocity  coup¬ 
ling  for  incompressible  flows.  All  these  features  are  familiar  to  those  in  the  fluid 
dynamics  community  and  make  this  a  formidable  set  of  equations  intractable 
by  theoretical  approaches.  The  parallelization  makes  things  even  worse  and  our 
option  was  to  study  parallel  fluid  algorithm  through  applications  to  a  series  of 
flow  geometries  and  conditions. 

In  the  present  study  we  discuss  the  numerical  behaviour  of  a  parallel  version 
of  the  SIMPLE  algorithm  [14].  We  show  how  the  convergence  was  influenced  lyv 
the  Reynolds  number,  grid  size,  number  of  subdomains  and  overlapping  between 
the  subdomains.  A  previous  work  [3]  has  been  followed  and  new  findings  and 
conclusions  were  added  as  a  result  of  a  new  set  of  calculations  using  different 
flow  geometry  and  conditions,  i.e.  the  laminar  backward-facing  step  flow. 
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2  Mathematical  model  and  strategy  of  parallelization 


The  fluid  flow  equations,  assuming  steady-state,  incompressible  and  Newtonian 
fluid,  were  solved  on  a  Cartesian  coordinate  system. 


dui 

dxi 


=  0 


(1) 


duiVj  _  dp  d 
^  di'j  dxi  dxj 


(2) 


where  t/,  is  the  velociti’  along  the  direction  x-,-,  p  is  the  pressure,  and  p  and  ft  are 
the  density  and  fluid  dynamic  viscosity,  respectively. 

These  equations  were  discretized  in  a  numerical  grid,  with  all  variables  being 
defined  at  the  same  location  (the  collocated  grid  [15])  and  following  the  finite 
I'olume  approach  [6],  The  hybrid  and  central  finite  differencing  schemes  were 
used  for  discretization  of  the  convective  and  diffusive  terms,  respectively. 


ni-nia  I 


Fig.  1.  Overlapping  and  data  exchange  between  subdomains 


Tiie  algorithm  entitled  .SIMPLE  [14],  was  used.  An  approach  where,  after 
rewriting  the  continuity  equation  (1)  as  a  function  of  a  pressure  correction  i^ari- 
able.  mass  and  momentum  conservation  are  alternately  enforced.  Equations  (1) 
and  (2)  are  solved  as  if  they  were  independent  (or  segregated)  .systems  of  equa¬ 
tions.  until  a  prescribed  criterion  of  convergence  can  be  satisfied. 

For  parallelization  of  the  algorithm,  the  integration  domain  was  split  along 
the  horizontal  direction  into  a  variable  number  of  subdomains,  with  overlapping 
(Figure  1).  Within  each  subdomain,  the  SIMPLE  algorithm  was  used,  followed 
by  exchange  of  data  (pressure  gradient,  and  both  u  and  v  velocities)  between 
subdomains.  This  was  one  iteration;  equivalent  to  one  iteration  of  the  .sequential 
version  of  the  SIMPLE  algorithm,  comprising  the  full  domain. 
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Tlie  simplest  case,  with  2  subdomains  only,  will  be  used  as  an  example  (Figure 
1).  The  first  column  of  subdomain  2  (the  west  boundary  condition)  was  taken 
as  one  of  the  columns  interior  to  subdomain  1.  The  last  column  of  subdomain 
]  (the  east  boundary  condition)  was  taken  as  one  of  the  columns  interior  to 
subdomain  2.  The  level  of  overlapping  between  the  subdomains  (nf  a)  depended 
on  where  inside  the  subdomains  the  interior  columns  were  located.  The  amount 
of  transferred  data  did  not  depend  on  the  overlapping:  that  was  a  function  of 
the  number  of  grid  nodes  along  the  vertical  only. 

Other  strategies  of  parallelization  have  been  suggested  involving  paralleliza¬ 
tion  of  the  equation  solver  only,  either  in  case  of  a  segregated  approach  (e.g.: 
[11])  or  in  cases  where  ail  the  fluid  equations  are  solved  simultaneously  as  part  of 
a  single  system  (e.g.:  [5]).  Either  of  these  two  alternatives,  when  compared  with 
the  present  parallelization  strategy,  improve  robustness  at  a  cost  of  increased 
communication  times. 

The  test  ca.se  was  the  laminar  backward-facing  step  flow  (Figure  2).  with 
an  expansion  ratio  of  1:2  and  a  domain  size  extended  over  30  step  height,  h. 
A  parabolic  \^elocity  profile  was  specified  at  the  inlet  section.  At  the  walls, 
velocities  were  set  to  zero  and  the  pressure  was  found  by  zero  gradient  along  the 
perpendicular  direction.  Zero  axial  gradient  was  the  boundary  condition  at  the 
outlet  section  for  all  variables. 

The  tri-diagonal  matrix  algorithm  (TDMA)  w^as  used  to  solve  the  system  of 
algebraic  linearized  equations,  with  under-relaxation  factors  of  0.7  for  velocities 
and  0.3  for  pressure.  The  number  of  TDMA  iterations  was  fixed  and  set  at  2 
for  u  and  v  velocities,  and  4  for  pressure.  There  was  no  previous  optimization 
of  these  parameters,  which  were  kept  unaltered  during  the  course  of  the  current 
study. 

The  calculations  were  stopped  when  the  residual  was  lower  than  10""^. 


3  Results 

Results  were  obtained  for  numerical  grids  ranging  from  100x64  up  to  250x160. 
and  Reynolds  number  between  10-  and  10^  in  steps  of  10-.  The  Reynolds  number 
was  defined  by  Rc=pU'2h/j.t.  where  U  is  the  average  axial  velocity  at  the  inlet 
section.  The  flow  regime  remains  laminar  for  Reynolds  number  lower  than  1200 
(<-'f.  [2]-  , 

The  calculations  were  performed  on  a  shared  memory  computer  architecture 
with  a  1.2  G:>B/s  system  bus  bandwidth  (SGI  Power  Challenge,  with  4  processors 
R8000/75  MHz  and  a  total  of  512  M2B  of  RAM;  operating  system  IRIX  6.1  and 
fortran  compiler  MIPSPro  Power  Fortran77  6.1).  The  communication  between 
processors  was  performed  via  version  3.3  of  PVM  message  passing  protocol  [7], 


3.1  The  flow  pattern 

For  a  prior  assessment  of  the  computer  program,  the  streamline  pattern  (Figure 
2)  was  analyzed.  For  all  Reynolds  number  being  tested,  downstream  of  the  step. 
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x^h 


Re=400 


Fig.  2.  Streamline  pattern  as  a  function  of  the  Reynolds  number 


there  tvas  a  main  recirculation  region,  who.se  length  increased  with  the  Reynolds 
number  (Figure  3).  The  results  follow  the  trend  as  observed  in  experiments  (cf. 
[2].  [16])  and  previou.s  calculations  (eg.  [10],  [18],  [4]),  The  deviation  from  the 
experimental  curve  at  /?e=500  has  been  attributed  to  three-dimensional  effects, 
not  accounted  for  by  our  calculations. 

For  a  Reynolds  number  around  400,  attached  to  the  top  wall  of  the  channel,  a 
.second  recirculation  appears  (Figure  2).  reaching  a  maximumlength  of  about  10/; 
at  i?f  =  1000  (the  highest  Reynolds  number  being  used).  This  is  a  flow  pattern 
that  has  been  observed  experimentally  (cf.  [2].  [16])  and  is  shown  here  to  confirm 
the  ciuality  of  our  calculations. 

Around  oOVt  of  the  computing  time  was  spent  within  the  routine  for  solu- 
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Fig.  3.  Size  of  the  main  recirculation  region  as  a  function  of  the  Reynolds  number 


tion  of  the  linear  system  of  equations.  The  assembling  of  the  coefficients  of  the 
three  differential  equations  of  pressure,  and  u  and  t;  velocities  required  489f. 
The  communication  time  was  estimated  to  be  2%  of  the  total  computing  time. 
This  was  a  consequence  of  the  computer  architecture,  but  also  a  consequence  of 
the  algorithm,  with  the  communication  overhead  much  reduced  compared  with 
parallel  fluid  algorithms  based  on  the  parallelization  of  the  equation  solver. 

3.2  The  convergence  of  the  parallel  algorithm 

\\>  were  interested  in  studying  the  effect  of  parallelization  on  the  convergence 
of  the  algorithm  and  Figure  4  shows  the  number  of  iterations  as  a  function  of 
Reynolds  number  for  four  grid  sizes.  In  case  of  the  parallel  version  the  results 
ha^•e  been  plotted  in  terms  of  global  iteration  and  the  equivalent  grid  size  of  the 
sequential  version.  There  are  two  regions  in  this  Figure. 

In  region  I  the  number  of  iterations  decreased  to  a  minimum,  obtained  at 
/?t=400  for  grids  200x128  and  250x160,  and  i?.e=500  for  grids  100x64  and 
150x96.  This  is  related  with  the  recirculation  region  attached  to  the  top  wall 
and  can  be  confirmed  by  joint  observation  of  Figures  4  and  2. 

For  Reynolds  number  higher  than  400  (or  500  for  grids  100x64  and  150x96). 
in  region  II,  the  number  of  iterations  increased  with  the  Re3uiolds  number.  The 
con\-ergence  becomes  more  difficult  as  the  recirculation  in  the  top  wall  increases 
(see  also  Figure  2).  The  convergence  is  tightly  coupled  with  the  flow  pattern. 

There  was  a  minimum  value  of  the  Reynolds  number  (region  I),  depending  on 
the  grid  size,  for  which  the  residual  did  not  fall  below  10“"*  and  the  convergence 
criterion  was  not  satisfied.  For  instance,  for  /?e=100  after  9000,  6000  and  .3550 
iterations,  for  grids  250x  160,  200x  128  and  150x96,  the  residual  became  constant 
at  3.28x10“'*.  1.97x10“'*  and  1.12x10“'*,  respectively.  For  coarser  grids,  the 
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Fig.  4.  Number  of  iterations  as  a  function  of  the  Reynolds  number  and  grid  size  (filled 
symbols:  parallel  version  (2  subdomains):  open  symbols:  sequential  version) 


constant  residual  was  lower  and  could  be  reached  at  a  lower  number  of  iterations. 
However,  the  flow  pattern  was  as  expected  and  we  concluded  that  the  criterion 
of  convergence  was  too  restrictive.  Nevertheless,  and  for  consistency  with  the 
remaining  cases  the  criterion  of  convergence  was  not  changed  and  we  consi¬ 
dered  these  as  non-converged  solutions.  To  some  extent,  this  is  related  with  the 
numerical  technique  used  to  solve  the  linear  system  of  equations.  For  instance, 
calculations  performed  using  the  SIP  (Strong  Implicit  Procedure  [17]),  solver 
were  able  to  satisfy  the  convergence  criterion  for  grid  150x96.  /?e=100. 

The  results  of  the  parallel  version  with  2  subdomains  (filled  symbols)  are 
also  included  in  Figure  4.  In  most  of  the  cases,  the  number  of  iterations  (global 
iterations)  of  the  parallel  was  identical  to  the  number  of  iterations  of  the  se¬ 
quential  version.  However,  there  were  cases  where  the  parallel  ^■ersion  converged 
faster  (e.g.  grid  150x96  and  /Ze=600),  whereas  in  the  most  unfavourable  situa¬ 
tion  (grid  100x64  and  /fe  =  700),  the  parallel  required  .3891  more  iterations  than 
the  sequential  version. 

The  convergence  characteristics  of  grids  100x64  and  200x128  are  shown  in 
Figure  5  as  a  function  of  the  Reynolds  number.  In  case  of  grid  100x64  (Figure  5a) 
for  Re  lower  than  500  the  sequential  requires  more  iterations  to  converge  than 
the  parallel  version,  whereas  for  Re  higher  than  500  the  .situation  is  reversed.  In 
general,  and  for  Re  higher  than  500  the  number  of  iterations  increases  with  the 
number  of  subdomains.  In  case  of  /?e  =  1000  the  calculation  with  3  subdomains 
requires  3100  iterations  compared  with  2800  of  the  sequential  version. 

The  behaviour  for  grid  200x128  (Figure  5b).  apart  from  also  an  increased 
number  of  iterations  with  the  Reynolds  number  (as  in  the  case  of  grid  100x64). 
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Fig.  5.  Number  of  iterations  as  a  function  of  Reynolds  number  and  number  of  subdo¬ 
mains  for  two  numerical  grids,  a)  100x64;  b)  200x128. 


evidences  other  details  worth  referring  to.  For  instance,  convergence  could  not 
be  achieved  for  Reynolds  number  lower  than  300,  even  in  case  of  the  sequential 
version.  There  wa.s  no  optimization  of  the  under-relaxation  factors.  This  exer¬ 
cise  was  beyond  the  scope  of  the  present  work.  It  is  our  experience  [12]  that  the 
SIMPLE  algorithm  requires  lower  under-relaxation  factors  for  finer  grids.  The 
plateau  shown  for  500  </?e<800  by  grid  200x128  may  be  associated  with  the 
occurrence  of  the  recirculation  region  at  the  bottom  wall  around  x/h—ll.  A  fea¬ 
ture  to  which  the  coarse  grid  could  not  be  sensitive  because  of  lack  of  resolution. 
This  is  a  statement  that  we  found  difficult  to  confirm. 

Figure  5  shows  that  the  number  of  subdomains  did  not  alter  the  general 
l.)attern  of  the  convergence  characteristics  compared  with  the  sequential  version. 
This  is  valid  throughout  the  current  study  and  led  us  to  the  conclusion  that  the 
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domain  splitting,  even  when  the  interface  between  the  subdomains  di\-ides  the 
recirculation  region,  does  not  affect  the  convergence. 


Fig.  6.  Number  of  iterations  as  a  function  of  subdomain  overlapping  and  number  of 
subdomains  for  grid  100x64  and  two  Reynolds  number,  a)  ffe  =  100:  b)  /?e  =  1000. 


3.3  Domain  overlapping 

The  effect  of  overlapping  between  subdomains  was  studied  [1]  for  f?f  =  100.  500 
and  1000  on  the  grid  100x64.  Results  are  shown  here  (Figure  6)  for  f?f  =  10U 
and  1000  as  a  function  of /j.  defined  as  the  ratio  between  the  number  of  nodes 
in  the  overlapping  regions  and  the  total  number  of  grid  nodes.  The  number  of 
nodes  was  identical  for  each  subdomain,  to  avoid  load  balancing  problems.  Tliis. 
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however,  restricted  the  overlapping  to  only  a  few  values,  depending  on  the  grid 
nodes  and  on  the  number  of  subdomains. 

The  results  at  /?f  =  100  (Figure  6a)  are  markedly  different  from  those  at 
/fe=1000  (Figure  6b).  At  /?f=100  the  parallel  requires  less  iterations  than  the 
sequential  \^ersion  and  is  not  sensitive  to  the  number  of  subdomains.  The  number 
of  iterations  increases  slightly  with  the  overlapping,  with  the  exception  of /,,=0.4. 

At  /?e=1000  for  each  case,  with  2,  3  or  4  subdomains,  there  is  a  value  of  /., 
beyond  which  there  is  no  reduction  of  the  number  of  iterations.  Furthermore,  for 
all  cases,  the  number  of  iterations  of  the  parallel  exceeds  the  number  of  iterations 
of  the  sequential  version. 


3.4  Speed-up 


The  speed-up  (5„)  was  defined  by 


Ts 

Tp 


(3) 


where  Ts  and  Tj,  stand  for  the  execution  time  of  the  sequential  and  parallel 
version  of  the  code.  The  computing  time  was  the  actual  wall-clock  elapsed  time, 
as  gi\'en  b\^  the  UNIX  command  time  and  following  the  recommendations  of  ref. 
[9], 

One  must  remember  that  the  parallel  version  performs  extra  calculations, 
because  the  grid  nodes  within  the  overlapping  region  are  calculated  twice.  Taking 
that  into  account,  the  expected  speed-up  values  are  1.8,  2.5  and  3.2.  for  2,  3  and 
4  subdomains,  respectively.  These  values  are  shown  by  horizontal  lines  in  Figure 
7  and  were  obtained  assuming  identical  number  of  iterations  for  both  sequential 
and  parallel  calculations,  time  per  iteration  directly  proportional  to  the  number 
of  nodes  and  negligible  communication  overhead. 

Figure  7a  shows  that  for  grid  100x64  and  Reynolds  number  lower  than  500 
the  speed-up  exceeded  the  expected  value.  The  domain  splitting  was  fa^'ourable 
to  the  convergence  of  the  algorithm,  and  the  number  of  iterations  was  reduced 
compared  with  the  sequential  version.  For  Reynolds  numbers  higher  than  500, 
the  speed-up  was  below  the  ideal  value,  in  particular  for  cases  with  3  and  4 
subdomains.  Neverthele.ss  in  real  time,  in  case  of  4  subdomains  a  converged 
solution  can  be  obtained  in  at  least  2.6 x  faster  compared  with  a  sequential 
calculation. 

Figure  7b  shows  a  fairly  uniform  trend  and  all  values  were  close  to  the  max¬ 
imum  speed-up. 


4  Conclusions 


Results  were  shown  of  the  simulation  of  the  laminar  backward-facing  step  flow.  A 
parallelized  \'ersion  of  the  SIMPLE  algorithm  was  used,  based  on  the  partitioning 
of  the  domain.  The  main  conclusions  were: 
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Fig.  7.  Speed-up  as  a  function  of  the  Reynolds  number,  in  case  of  2,  3  and  4  sulxloinains 
for  two  numerical  grids,  a)  100x64:  b)  200x128. 


1 .  The  minimum  number  of  iterations  was  obtained  for  Reynolds  number  about 
400  or  500.  depending  on  the  grid  size.  The  convergence  pattern  was  tightly 
coupled  with  the  flow  pattern,  and  the  number  of  iterations  increased  a.s 
soon  as  a  .second  recirculation  region  attached  to  the  top  wall  appeared. 

2.  The  comparison  between  the  sequential  and  the  parallel  \'ersions  of  the  al¬ 

gorithm  showed  that  the  number  of  iterations  was  in  many  cases  identical  in 
both  versions.  The  communication  overhead  was  of  the  total  computing- 
time.  ° 

3.  The  number  of  subdomains  did  not  alter  the  general  pattern  of  the  cont'er- 
gence  characteristics  compared  with  the  sequential  version. 

4.  The  convergence  was  not  affected  by  domain  splitting,  even  when  the  inter¬ 
face  between  the  subdomains  divided  the  recirculation  region. 
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Abstract.  Caching  has  been  intensively  used  in  memory  and  traditional 
file  st'stems  to  improve  system  performance.  However,  the  use  of  caching 
in  parallel  file  systems  has  been  limited  to  I/O  nodes  to  avoid  cache 
coherence  problems.  In  this  paper  we  present  the  cache  mechanisms  im¬ 
plemented  in  ParFiSys,  a  parallel  file  system  developed  at  the  UPM. 
ParFiSys  exploits  the  use  of  cache,  both  at  processing  and  I/O  nodes, 
vdth  aggressive  prefetching  and  delayed-write  techniques.  The  cache  co¬ 
herence  problem  is  solved  by  using  a  dynamic  scheme  of  cache  coherence 
protocols  with  different  sizes  and  shapes  of  granularity.  Performance  re- 
.sults,  obtained  on  an  IBM  SP2,  are  presented  to  demonstrate  the  advan¬ 
tages  offered  by  the  cache  management  methods  used  in  ParFiSys. 
Keywords:  Parallel  file  systems,  data  declustering,  cache  coherence  pro¬ 
tocols,  false  sharing,  multi-files. 


1  Introduction 

There  is  a  general  trend  to  use  parallelism  in  the  I/O  systems  to  alleviate  the 
growing  disparity  in  computational  and  I/O  capability  of  the  parallel  and  dis¬ 
tributed  architectures.  Parallelism  in  the  I/O  subsystem  is  obtained  using  several 
independent  I/O  nodes  supporting  one  or  more  secondary  storage  devices.  Data 
are  declustered  among  these  nodes  and  devices  to  allow  parallel  access  to  differ¬ 
ent  files,  and  parallel  access  to  the  same  file.  Parallelism  has  been  used  in  some 
parallel  file  systems  and  I/O  libraries  described  in  the  bibliography  (CFS  [20], 
Vesta  [5],  ParFiSys  [2],  PASSION  [4],  Galley  [19],  Scotch  [11],  PIOUS  [17]). 

Caching  has  been  a  technique  frequently  used  in  memory  and  traditional  file 
systems  to  improve  system  performance.  Caching  [14,2]  can  be  used  in  parallel 
file  systems,  by  allocating  a  buffer  cache  at  the  processing  nodes  (PN)  and  I/O 
nodes  (ION).  This  approach  improves  I/O  performance  by  avoiding  unnecessary 
disk  traffic,  network  traffic  and  servers  load,  and  also  by  allowing  prefetching 
and  delayed-write  techniques  [14,2].  However,  the  use  of  caching  in  parallel  file 
systems  has  been  limited  to  I/O  nodes  because  any  attempt  to  maintain  caching 
at  the  processing  nodes  would  require  a  cache  coherence  protocol. 

In  this  paper  we  demonstrate  that  the  use  of  caching  at  processing  nodes  is 
feasible  in  parallel  file  systems.  With  this  aim  we  show  the  cache  management 
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policies  and  mechanisms  implemented  in  ParFiSys^ ,  a  parallel  file  system  de¬ 
veloped  at  the  UPM’’^.  ParFiSys  exploits  the  use  of  caching  both  at  processing 
and  I/O  nodes.  It  avoids  the  cache  coherence  and  false  sharing  problems  in  an 
efficient  manner  by  using  a  dynamic  scheme  of  cache  coherence  protocols  with 
different  sizes  and  shapes  of  granularity. 

The  rest  of  the  paper  is  organized  as  follows.  Section  2  describes  ParFiSys 
main  architectural  features.  Section  3  presents  the  cache  management  imple¬ 
mented  in  ParFiSys.  The  cache  coherence  problem  and  how  ParFiSys  solved 
this  problem  is  explained  in  section  4.  Performance  measurements  are  shown  in 
section  5.  Finallj’,  section  6  summarizes  our  conclusions. 


2  ParFiSys  Architecture 

ParFiSys  [2,3]  is  a  parallel  and  distributed  file  system  developed  at  the  UPM  to 
provide  parallel  I/O  services  for  parallel  and  distributed  systems.  To  fully  exploit 
all  the  parallel  and  distributed  features  of  the  I/O  hardware,  the  architecture 
of  ParFiSys  is  clearly  divided  in  two  levels:  file  services  and  block  services  (see 
figure  1.)  The  first  level  is  comprised  into  a  component  named  ParClient.  It  pro- 
■\ddes  file  services  that  can  be  obtained  using  two  mechanisms:  linked  library  or 
message  passing  library.  The  first  is  preferred  for  parallel  machines  [4, 13],  where 
a  single  user  is  usually  expected  per  processing  node  (PN)-  The  second  is  aimed 
to  be  used  in  distributed  systems,  where  several  users  may  be  requesting  I/O 
services  to  the  ParClient  Both  modalities  provide  the  users  with  a  ParClient 
library  that  includes  the  POSIX  interface  and  some  high-performance  extensions 
[2]  .  The  main  architectural  difference  between  the  former  models,  is  that  with 
the  linked  library  approach  the  ParClient  has  to  be  present  on  every  PN  request¬ 
ing  I/O,  while  the  message  passing  option  allows  the  existence  of  remote  users 
for  a  ParClient.  This  option  is  specifically  though  for  distributed  file  systems 
or  big  scale  parallel  machines,  and  it  can  be  used  to  define  groups  of  users  re¬ 
lated  with  a  single  ParClient  to  increase  scalability  [2],  The  ParClient  translates 
user  addresses  to  logical  blocks  establishing  the  connections  with  the  ParServer. 
All  communication  is  handled  through  a  high  performance  I/O  librar3^  named 
ParServer  librarj".  This  library  optimizes  the  I/O  requests  and  sends  them  to 
the  I/O  servers  via  message  passing.  It  also  controls  the  flow  of  data  from  the 
I/O  servers  to  the  application’s  address  space  and  vice-versa.  The  ParServer.  lo¬ 
cated  at  the  input/output  nodes  (lONs),  deal  with  logical  block  requests  issued 
by  the  ParClient,  translating  them  to  the  local  secondarj'  storage  devices.  Both 
levels  intensively  use  caching  to  optimize  I/O  operations:  the  ParClient  to  avoid 
remote  operations,  and  the  ParServer  to  reduce  accesses  to  devices. 

ParFiSys  uses  a  ver}'  generic  distributed  partition  which  allows  to  create  sev¬ 
eral  t3-pes  of  file  S3'stems  on  any  kind  of  parallel  I/O  system.  A  distributed  parti¬ 
tion  has  a  unique  identifier,  ph3"sical  layout,  list  of  sub-partitions,  etc.  The  physi¬ 
cal  layout,  represented  as  the  tuple  {{NODEn],  {CTLRc]n- {DEVdjnc)- 

’  http://laurel.datsi.fi.upm.es/'gp/parfisys.htral 

^  ParFiSys  has  been  developed  under  EU’s  ESPRIT  Project  GPMIMD  (P-5404) 
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PN  PN 


Fig.  1.  ParFiSys  Architecture 


describes  the  set  of  I/O  nodes,  controllers  per  node,  and  devices  per  controller. 
ParFiSys  partitions  can  be  modified  by  the  administrator  freely,  adding  or  re¬ 
moving  devices  dynamically.  The  only  restriction  to  be  considered  is  that  devices 
being  used  by  the  existing  applications  should  not  be  affected.  The  current  im¬ 
plementation  of  ParFiSys  supports  three  kinds  of  predefined  file  systems  on  the 
partition  structure  (see  figure  2): 

-  UNIX-like  non-distributed  file  systems,  where  \NODE\  =  1,  \CTLR\  =  1, 
\DEV\  =  1.  . 

-  Extended  distributed  file  systems  with  sequential  layout,  where  \NODE\  = 
k,  \CTLR\  =  n,  \DEV\  -  m.  They  can  be  seen  as  a  concatenation  of  UNIX- 
like  partitions. 

-  Distributed  file  s}''stems  striped  with  cyclic  layout,  where  \NODE\  =  k. 
\CTLR.\  =  n,  \DEV\  =  rn..  Blocks  are  distributed  through  the  partition 
devices  following  a  round-robin  pattern. 


3  Cache  Management  in  ParFiSys 

ParFiSys  exploits  the  use  of  caching  both  in  ParClient  and  ParServer  (see  figure 
1),  allowing  multiple  readers  and  writers  concurrently.  Specially  important  is  the 
use  of  cache  at  ParClient  Each  ParClient  has  a  buffer  cache  that  is  maintained 
by  a  ParClient  cache  manager.  When  a  user  request  is  received  in  a  ParClient. 
the  whole  buffer  is  analyzed  to  obtain  the  list  of  file  blocks  associated  to  the 
buffer. 
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Fig.  2.  File  Systems  Available  on  ParFiSys 


Once  the  user  buffer  has  been  mapped  to  file  blocks,  the  whole  block  list  is 
searched  in  the  ParClient  cache  with  a  single  search  operation.  After  this  step, 
two  new  lists  of  blocks  are  obtained:  present  and  absent  blocks.  The  present 
blocks,  those  found  on  the  cache,  are  immediately  copied  to  the  user  space.  The 
absent  blocks,  those  not  found  in  the  cache,  have  to  be  requested  to  the  I/O 
devices  through  the  ParServer  library.  This  library  is  conceived  as  a  high-level 
device  driver  that  concurrently  manages  the  ION  related  operations.  Requests 
not  serviced  are  enqueued  and  consumed  later  by  a  thread  that  explores  the  list 
of  blocks  from  the  queue  and  generates  an  independent  list  of  blocks  per  each 
ION.  These  blocks  are  requested  concurrently  to  each  ParServer,  overlapping 
I/O  and  computation  (see  figure  3.)  The  result  of  this  stage  will  be  different 
depending  on  the  mapping  function  associated  to  each  logical  device.  Anyway, 
once  the  set  of  blocks  stored  into  an  I/O  node  has  been  determined,  a  thread 
is  awaken  to  take  care  of  the  ParClient- ParServer  operations  related  to  this 
set  of  blocks.  All  the  threads  execute  concurrently,  notifying  the  end  of  their 
operations  by  synchronizing  themselves  with  a  barrier  previously  established  by 
the  ParServer  hhvavy.  An  I/O  operation  is  finished  at  this  level  only  when  all  the 
threads  have  reached  the  barrier.  At  this  moment,  the  result  is  notified  to  the 
cache  management  procedures  and  the  involved  I/O  operations  are  finished.  The 
number  of  client-server  interactions  needed  depends  on  the  maximum  number  of 
blocks,  named  grouping  factor,  that  can  be  requested  from  each  file  sj'stem  on  a 
single  operation. 

This  cache  scheme  exploits  the  parallelism  in  the  accesses,  having  two  main 
advantages:  there  is  not  synchronization  among  the  ParServer  involved  in  an 
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Fig.  3.  Parallel  Operations  in  ParFiSys 


I/O  operation,  and  there  is  not  sequentialization  in  the  ParClient- Par  Server 
communication. 

Two  additional  mechanisms  have  been  used  to  enhance  the  behavior  of  the 
cache  in  ParFiSys: 


Read-ahead  Each  ParServer  reads  ahead  data  using  an  Infinite  Block  Looka¬ 
head  (IBL)  predictor,  that  computes  the  number  of  blocks  (n)  to  be  read  in 
advance.  Prefetch  is  executed  in  the  ParClient  using  an  adaptive  predictor,  valid 
for  sequential  and  interleaved  patterns,  whose  behavior  depends  on  the  I/O  pat¬ 
terns  exhibited  by  the  local  processes.  Prefetch  is  executed  as3mchronously  to 
the  user  requests  to  enhance  the  answer  time. 

Write-before-full  This  is  a  delayed-write  policy  that  flush  dirty  blocks  from 
the  ParClient  to  the  ParServer,  and  from  the  ParServer  to  the  I/O  devices, 
before  free  blocks  maj’  be  needed  in  the  cache.  Preflush  is  activated  when  a  low 
threshold,  calculated  by  the  write-before-full  daemon,  is  reached.  When  a  write 
request  is  executed,  the  number  of  dirty  blocks  in  the  cache  is  computed.  If  it 
is  larger  than  a  flxed  threshold,  a  massive  flush  is  executed  for  the  dirty  blocks 
belonging  to  the  file  system  storing  the  file.  All  the  operations  are  executed 
asAmchronously  to  the  user  requests  to  avoid  delaying  the  answer  time. 

4  Avoiding  the  Cache  Coherence  Problem 

The  main  problem  of  using  caching  at  the  client  nodes  is  the  possibility  of  having 
shared  wTiting  of  a  file  from  different  clients  [1,18,10],  which  might  lead  to  an 
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incoherent  view  of  data.  Nelson  [18]  describes  two  forms  of  write-sharing;  sequen¬ 
tial  write- sharing  (SWS),  that  occurs  when  a  client  reads  or  writes  a  file  that 
•was  previously  written  by  another  client,  and  concurrent  write-sharing  (CWS), 
that  occurs  when  a  file  is  simultaneously  open  for  reading  and  writing  on  more 
than  one  client.  Concurrent  write-sharing  is  not  usual  in  distributed  file  systems 
[1,18],  but  it  is  very  frequent  in  parallel  file  svstems  [16]  and  meta-computing 
[22]. 

A  cache  coherence  protocol  is  required  to  avoid  this  problem.  The  use  of 
cache  coherence  protocols  has  been  unpopular  in  parallel  file  systems  because  of 
its  overhead  [16].  Thus,  most  parallel  file  systems  usually  have  caching  schemes 
such  as  the  ones  implemented  in  CFS  [20],  where  only  the  lONs  maintain  a  buffer 
cache  for  files.  This  solution  avoids  cache  coherence  problems  because  there  only 
is  a  single  copy  of  the  data  in  the  whole  system.  Distributed  file  systems,  where 
VTite  sharing  is  infrequent  [18],  use  cache  coherence  protocols  mostly  based  on 
weak  [12]  or  coarse  grain  models  [18].  However,  most  existing  file  systems  with 
cache  coherence  protocols  fail  to  provide  efficient  solutions  to  the  problem  of 
cache  coherence  for  parallel  applications  that  concurrently  write  a  file.  NFS  [23], 
very  popular  in  commercial  environments,  is  unable  to  maintain  a  consistent 
tdew  of  the  file  system  for  parallel  applications.  AFS  [12]  does  not  support  con¬ 
current  write-sharing  due  to  the  session  semantic  implemented,  which  makes 
it  not  suitable  for  parallel  applications.  Sprite  [18]  ensures  concurrent  -write- 
sharing  coherence  by  disabling  client  caches,  thus  limiting  the  potential  benefits 
of  caching  for  many  parallel  applications.  There  are  very  few  cache  coherence 
solutions  in  parallel  file  systems.  ENWRICH  [21]  provides  a  cache  coherence  so¬ 
lution  for  parallel  file  systems,  but  it  is  not  a  general  one  because  client  caches 
are  only  used  for  writing. 

Recently  other  approaches  to  avoid  cache  coherence  problem  in  parallel  and 
distributed  file  systems  have  been  proposed.  An  example  is  the  cooperative  cache 
scheme  proposed  in  [7],  that  eliminates  the  cache  coherence  problem  by  avoiding 
data  replication.  This  solution,  however,  reduces  the  potential  parallelism  that 
the  use  of  data  replication  may  offer. 


4.1  ParFiSys  coherence  model 

In  order  to  solve  the  write-sharing  problem,  ParFiSys  uses  a  dynamic  cache 
coherence  scheme  [9, 10]  based  on  the  following  protocols; 

-  Sequential  coherence  protocol  (SCP),  that  solves  the  SWS  problem  and  de¬ 
tects  the  CWS  on  a  file. 

-  Concurrent  coherence  protocol  (CCP),  that  solves  the  CWS  problem  after 
being  activated  when  the  SCP  detects  a  CWS  situation  on  a  file. 

Sequential  coherence  protocol  This  protocol  ensures  coherence  in  S^^’S  sit¬ 
uations  and  detects  CWS  situations  on  a  file.  It  has  a  behavior  similar  to  the 
Sprite  protocol  [18];  the  servers  track  open  and  close  operations  to  know  not  only 
which  clients  are  currently  using  a  file,  but  whether  any  of  them  are  potential 
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wTiters.  When  a  client  opens  a  file,  the  event  is  notified  to  the  server  storing  the 
file  descriptor.  If  there  is  no  CWS  for  the  file,  the  server  looks  whether  another 
client  has  updated  the  file  data  in  its  local  cache,  because  of  the  delaj'ed-write 
policy,  and  requests  it  to  flush  the  data.  When  the  server  has  the  most  up  to 
date  copy  of  the  file,  a  message  is  sent  to  the  client  to  enable  local  caching  for 
the  file.  No  more  interactions  with  the  server  are  needed  to  maintain  coherence, 
thus  alleviating  overhead.  When  a  CWS  situation  is  detected  by  the  server,  a 
message  is  sent  to  all  the  clients  with  the  file  open  to  activate  the  CCP.  When 
the  CWS  situation  disappears,  a  message  is  sent  to  all  the  clients  with  the  file 
open  to  deactivate  the  CCP. 

SCP  has  been  optimized  to  reduce  client-server  interactions  by  sending  co¬ 
herence  messages  to  the  servers  only  when  a  change  of  the  client  local  state  of 
the  file  occurs.  The  local  state  of  a  file  changes  when  it  is  open  the  first  time, 
when  it  was  open  for  read  and  it  is  open  for  write,  when  it  is  closed  for  write 
and  it  remains  open  for  read,  and  when  it  is  closed  by  the  last  user  in  the  client. 
SCP  has  also  been  optimized  to  reduce  servers  load  by  distributing  the  protocol 
o^"erhead  among  all  servers.  Each  server  executes  SCP  only  for  the  files  whose  de¬ 
scriptors  are  stored  on  it,  which  alleviates  the  bottleneck  of  a  centralized  service 
and  improves  scalability. 


Concurrent  coherence  protocol  This  protocol  solves  the  CWS  problem.  It  is 
activated  when  a  CWS  is  detected  on  a  file,  being  executed  on  each  access  to  the 
file  while  the  CWS  situation  remains.  CCP  is  based  on  invalidations,  directories 
and  the  existence  of  a  exclusive  write-shared  copy  of  data. 

The  main  problem  in  cache  coherence  protocols  is  the  false  sharing  situation 
generated  when  multiple  processes,  belonging  to  the  same  parallel  application, 
access  the  same  file  for  writing.  To  alleviate  false  sharing  problems,  it  would 
be  desirable  to  allow  the  parallel  applications  to  adjust  the  granularity  of  the 
protocol  to  their  I/O  patterns.  ParFiSys  ensures  this  by  allowing  the  users  to 
define  coherence  regions.  A  coherence  region  is  a  disjoint  subset  of  the  file  used 
as  the  coherence  unit.  It  has  two  main  features:  size  and  shape.  The  size  of  a 
region  may  range  from  the  whole  file  to  byte.  The  shape  of  a  region  can  be 
defined  according  to  the  most  frequent  parallel  access  patterns:  sequential  and 
interleaved  (see  figure  4). 

The  applications  can  define  the  mapping  of  the  regions  on  a  file  using  four 
parameters  (see  figure  4): 

-  Register  size,  minimum  unit  for  coherence. 

-  Register  stride,  width  of  the  register  groups  into  each  segment. 

-  Segment  size,  number  of  register  groups  in  a  segment. 

-  Segment  stride,  distance  between  two  segments  with  the  same  pattern. 

This  model  of  region  is  suitable  to  map  very  different  parallel  I/O  patterns. 
Moreover,  it  allows  to  define  optimal  regions,  the  best  suited  to  the  I/O  access 
pattern,  to  minimize  coherency  overhead.  Optimal  regions  are  defined  on  a  file 
when: 
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Fig.  4.  Some  Coherence  Region  Patterns 
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1.  The  number  of  regions  is  equal  to  the  number  of  process  in  the  parallel 
application. 

2.  Each  process  only  accesses  the  data  of  a  single  region. 

The  use  of  optimal  regions  allows  to  adjust  perfectly  the  protocol  grariularit,}' 
and  the  I/O  pattern  of  the  applications,  offering  coherence  with  a  minimal  cost. 
This  coherence  regions  model  can  be  applied  to  the  High  Performance  Fortran 
distributions  [8],  Vesta  interface  [5]  and  MPI-10  interface  [6]. 

CWP  compels  the  clients  to  check  the  coherence  state  of  the  region  on  every 
access  to  it  by  acquiring  the  appropriate  read  or  write  tokens.  When  a  client 
does  not  have  the  appropriate  token,  a  message  to  the  region’s  manager  must  be 
sent  to  request  the  desired  rights  on  the  region.  The  server  stores  a  callback  for 
each  region  in  a  coherence  directory  to  trace  the  coherence  state  of  the  region. 
If  a  conflict  is  detected,  the  callbacks  are  revoked.  The  region’s  manager  guar¬ 
antees  that  at  any  given  time  there  is  a  single  read-write  token  or  any  number 
of  read-only  tokens.  When  a  write  token  is  revoked  in  a  client,  the  client  must 
flush  any  dirty  data  of  the  region.  If  the  new  token  is  for  read  the  token  i7i  the 
previous  writer  is  changed  from  writing  to  read-only.  When  the  appropriated 
rights  on  a  region  are  acquired,  they  remain  until  explicit  revocation,  thus  elimi¬ 
nating  the  overhead  for  future  accesses.  Two  main  design  mechanisms  were  used 
m  ParFiSys  to  define  the  coherence  protocols:  callback  and  directory  location 
and  management,  and  callback  revocation  policy.  Two  policies  can  be  used  for 
the  first  issue:  centralized  (C),  where  coherence  for  all  the  regions  of  a  hie  is 
maintained  by  the  server  storing  the  file  descriptor  (region  manager),  and  dis¬ 
tributed  (D),  where  the  information  of  the  coherence  region  is  distributed  among 
several  servers,  each  one  being  responsible  of  maintaining  the  coherence  of  the 
regions  allocated  to  it.  Two  approaches  can  be  used  to  revoke  callbacks  when  a 
conflict  appears:  server  driven  (SD)  and  client  driven  (CD).  In  SD,  the  server 
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Fig.  5.  Point-to-point  communication  throughput 


sends  revocation  messages  to  all  the  clients  caching  data  from  the  conflictive 
region.  In  CD,  the  client  generating  the  conflict  sends  revocation  messages,  on 
behalf  of  the  servers,  to  other  clients  caching  data  from  the  conflictive  region. 
Several  CCP  have  implemented  in  ParFiSys  by  combining  different  management 
and  revocation  policies:  C-SD,  C-CD,  D-SD,  and  D-CD. 

5  Performance  Evaluation 

This  section  describes  the  performance  experiments  designed  to  test  cache  man¬ 
agement  in  ParFiSysa  and  the  results  obtained  by  running  them  on  an  IBM  SP2 
machine. 

The  IBM  SP2  used  is  a  distributed-memory  MIMD  machine  with  14  nodes 
available  to  execute  parallel  applications.  Each  node  has  a  66  MHz  POWER2 
RISC  System/6000  processor  with  256  MB  of  memorjc  being  connected  to  botli 
an  Ethernet  and  IBM’s  high  performance  switch.  Because  of  the  IBM's  message¬ 
passing  libraries  (PVM,  MPL  or  MPI)  cannot  operate  in  a  multi-threaded  envi¬ 
ronment,  we  have  implemented  a  multi-threaded  subset  of  MPI  using  TCP/IP  on 
top  of  the  high  performance  switch.  To  characterize  the  message-passing  perfor¬ 
mance  of  our  communication  library,  we  executed  two  simple  benchmarks  on  the 
SP2.  The  first  evaluates  point-to-point  communication  throughput  by  engaging 
two  processes  in  a  sort  of  ping-pong.  One  process  reads  the  value  of  a  wall-time 
clock  before  invoking  a  send  operation  and  it  then  blocks  in  a  receive  opera¬ 
tion.  Once  the  latter  operation  finishes,  the  clock  is  read  again.  The  throughput 
achieved  is  computed  with  a  half  of  the  communication  time  and  the  message 
size.  Figure  5  shows  the  results  obtained  for  this  benchmark.  The  maximum 
throughput  between  two  nodes  is  approximate!}’’  60  MB/s. 

The  second  benchmark  emulates  ParFiSys  reading  and  writing  activity-  ^^’e 
used  4  servers  and  varied  the  number  of  clients  from  1  to  8.  Clients  send  (write 
to  servers)  or  receive  (read  from  servers)  100  MB  declustered  across  the  servers. 
Each  client  sends,  or  receives,  the  same  amount  of  data  to,  or  from,  the  servers 
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Fig.  6.  Total  message  passing  throughput  in  ParFiSys 


Fig.  7.  Prefetching  and  ■write-before-full  performance 


using  a  fixed  record  size.  Figure  6  shows  the  total  throughput  obtained  for  read¬ 
ing  and  ■w’riting  operations.  As  shown  in  the  figure,  the  maximum  throughput 
achieved  increases  with  the  number  of  clients  and  the  record  size. 

All  experiments  described  below  were  executed  using  4  ION,  with  a  single 
simulated  disk  with  a  bandwidth  of  5  MB/s  on  each.  In  all  tests  a  4  MB  per- 
client  cache  and  a  16  MB  per-ION  cache  were  used.  The  file  size  used  in  all 
experiments  was  100  MB.  The  stripe-width  was  4  and  the  stripe-unit  size  was  8 
KB. 


5.1  Prefetch  and  write-before- full  evaluation 

Figure  7,  left,  shows  the  throughput  obtained  when  a  client  sequentially  reads 
a  file  of  100  MB.  This  test  varied  the  read-ahead  value,  from  0  KB  to  256  KB, 
and  the  access  size,  from  1  KB  to  64  KB. 

Figure  7,  right,  shows  the  throughput  obtained  when  a  client  sequentially 
vTites  a  file  of  100  MB  using  either  write-through  or  write-before-full.  The 
threshold  used  in  write-before-full  was  a  95  %  of  the  ParClient  cache  size.  The 
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results  obtained  demonstrate  that  having  a  cache  at  processing  nodes,  managed 
using  prefetch  and  write-before-full,  is  a  useful  mechanism  to  increase  read  and 
wite  performance  in  parallel  file  systems. 


5.2  Cache  Coherence  Performance  in  ParFiSys 

Two  benchmarks  were  defined  to  evaluate  the  ParFiSys  coherence  protocol  and 
to  demonstrate  their  feasibility  for  parallel  applications:  a  segmented  concur¬ 
rent  write  benchmark  (SCWB)  and  a  interleaved  concurrent  write  benchmark 
(ICWB).  The  SCWB,  similarly  to  the  one  described  in  [17],  is  a  parallel  pro¬ 
gram  with  partitioned  access  that  divides  a  file  into  contiguous  segments,  one 
per  process,  with  each  segment  accessed  sequentially  by  a  different  process.  In 
the  ICWB,  each  process  concurrently  writes  the  file  in  a  interleaved  fashion. 
The  parallel  program  for  each  benchmark  consists  of  1,  2,  4  and  8  processes  that 
concurrently  write  a  file  of  100  MB.  Two  access  sizes  are  used:  8  KB  and  32  KB. 


Numbw  otclienls 


Number  ol  clients 


Fig.  8.  Concurrent  segmented  and  interleaved  benchmark  for  8  KB  access  size 


Figures  8  and  9  show  the  aggregated  bandwidth  for  concurrent  segmented 
and  interleaved  I/O  patterns,  respectively.  Several  cache  coherence  protocols  us¬ 
ing  SCWB  and  ICWB  are  evaluated:  client  cache  deactivated  (NO  CACHE), 
file  granularity  (WHOLE  F.),  block  granularity  for  centralized  protocol  (BL_C). 
atid  optimal  regions  for  centralized  protocol  (OPT.C).  The  first  relevant  result 
obtained  shows  that  maintaining  coherency  with  file  granularity  is  the  worst 
method,  mainly  due  to  false  sharing.  Deactivating  cache  is  very  similar  to  tlie 
block  centralized,  because  the  number  of  client-server  interactions  is  almost  the 
same.  The  small  difference  observed  between  deactivating  cache  and  block  dis¬ 
tributed  is  mainly  due  to  the  lower  contention  generated  on  the  servers  by  the 
colierence  protocols.  This  feature  also  makes  the  block  distributed  protocol  more 
scalable.  The  best  results  are  obtained  using  optimal  regions,  both  centralized 
and  distributed.  This  behavior  is  mainly  due  to  the  false  sharing  elimination. 
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Fig.  9.  Concurrent  segmented  and  interleaved  benchmark  for  32  KB  access  size 


the  main  problem  with  whole  file  granularity,  the  minimization  of  the  coher¬ 
ence  load,  the  main  problem  in  block  granularity  protocols,  and  the  local  cache 
utilization,  the  main  problem  in  cache  deactivation. 

Concurrent  write  benchmarks  for  segmented  and  interleaved  I/O  patterns 
show  that  optimal  regions  provide  a  performance  verj'  close  to  the  ideal  one. 
Moreover,  they  show  a  good  scalability  compared  with  the  other  protocols,  whose 
performance  decreases  very  quickly  as  the  number  of  clients  is  increased. 

Figure  10  compares  the  bandwidth  obtained  in  the  SCWB  using  normal  files 
with  optimal  regions  for  centralized  and  distributed  protocols,  versus  the  use  of 
multi-files.  A  multi-file  [15, 2]  is  a  collection  of  subfiles,  each  of  which  is  a  separate 
sequence  of  bytes.  A  multi-file  is  created  by  a  parallel  program  with  a  certain 
number  of  subfiles,  usually  equal  to  the  number  of  processes  in  the  program, 
with  each  process  accessing  its  owm  subfile.  A  multi-file  combines  the  advantages 
of  a  single  file,  single  name  for  a  single  data  set,  with  those  of  multi-files,  that 
are  independently  addressable.  Multi-files  are  an  alternative  mechanism  to  the 
segmented  patterns  used  in  SCWB.  The  results  are  very  similar  in  all  cases, 
being  a  little  better  for  multi-files  because  of  the  lighter  writing  operation  on  each 
subfile  than  on  a  normal  file  with  concurrent  access.  The  proposed  protocols  with 
normal  files  offer  an  efficient  alternative  to  the  use  of  multi-files  for  segmented 
I/O  patterns,  because  present  a  performance  very  similar  with  the  advantage  of 
using  less  specialized  and  more  portable  interfaces. 

6  Conclusions 

This  paper  has  presented  the  design  of  the  cache  scheme  implemented  in  ParFiSys. 
ParFiSys  allocates  buffer  caches  at  the  processing  and  I/O  nodes,  improving  I/O 
performance  by  avoiding  unnecessary  disk  traffic,  network  traffic  and  servers 
load,  and  by  allowing  prefetching  and  delayed-write  techniques.  The  cache  co¬ 
herence  problem  is  solved,  without  loss  of  scalabilit}',  b}’  using  a  dynamic  scheme 
of  cache  coherence  protocols  wffiere  data  coherence  is  maintained  on  user-defined 
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Fig.  10.  Optimal  regions  in  SCWB  versus  multi-files  for  8K  B  and  32  KB  access  size 


coherence  regions  for  the  conflictive  file.  The  utilization  of  two  protocols,  SWP 
and  CWP,  allows  to  afford  all  the  conflictive  situations  for  SWS  and  CWS  pat¬ 
terns,  as  demonstrated  with  the  evaluation  results  obtained  by  running  ParFiSys 
on  an  IBM  SP2.  The  benchmarks  used  to  test  the  model  show  considerably  bet¬ 
ter  results  for  our  model  than  for  other  existing  models.  The  aggregated  band¬ 
width  obtained  is  higher  when  using  our  model,  mainly  because  false  sharing 
is  reduced,  coherence  load  is  minimized,  and  local  caches  at  processing  nodes 
are  heavily  used.  The  proposed  protocols  also  offer  an  efficient  alternative  to 
the  use  of  multi-files  for  segmented  I/O  patterns,  because  present  a  very  simi¬ 
lar  performance  with  the  advantage  of  using  less  specialized  and  more  portable 
interfaces. 
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Abstract.  We  study  the  Jacobi-Davidson  method  for  the  solution  of 
large  generalized  eigenproblems  as  they  arise  in  MagnetoHydroDynamics. 
We  have  combined  Jacobi-Davidson  (using  standard  Ritz  values)  with 
a  shift  and  invert  technique.  We  apply  a  complete  LU  decomposition  in 
which  reordering  strategies  based  on  a  combination  of  block  cyclic  reduc¬ 
tion  and  domain  decomposition  result  in  a  well-paxallelizable  algorithm. 
Moreover,  we  describe  a  variant  of  Jacobi-Davidson  in  which  harmonic 
Ritz  values  are  used.  In  this  variant  the  same  parallel  LU  decomposi¬ 
tion  is  used,  but  this  time  as  a  preconditioner  to  solve  the  ‘correction' 
equation. 

The  size  of  the  relatively  small  projected  eigenproblems  which  have  to 
be  solved  in  the  Jacobi-Davidson  method  is  controlled  by  several  pa¬ 
rameters.  The  influence  of  these  parameters  on  both  the  parallel  perfor¬ 
mance  and  convergence  behaviour  will  be  studied.  Numerical  results  of 
Jacobi-Davidson  obtained  with  standard  and  harmonic  Ritz  values  will 
be  shown.  Executions  have  been  performed  on  a  Cray  T3E. 


1  Introduction 

Consider  the  generalized  eigenvalue  problem 

Ax  =  XBx,  (1) 

in  which  A  and  B  are  complex  block  tridiagonal  Nfhy-Nt  matrices  and  B  is 
Hermitian  positive  definite.  The  number  of  diagonal  blocks  is  denoted  by  N 
and  the  blocks  are  n-by-n,  so  Nf  =  N  x  n.  In  close  cooperation  with  the  FOM 
Institute  for  Plasma  Physics  “Rijnhuizen”  in  Nieuwegein,  where  one  is  interested 
in  such  generalized  eigenvalue  problems,  we  have  developed  a  parallel  code  to 
solve  (1).  In  particular,  the  physicists  like  to  have  accurate  approximations  of 
certain  interior  eigenvalues,  called  the  Alfven  spectrum.  A  promising  method  for 
computing  these  eigenvalues  is  the  Jacobi-Davidson  (JD)  method  [3,  4].  With 
this  method  it  is  possible  to  find  several  interior  eigenvalues  in  the  neighbourhood 
of  a  given  target  a  and  their  associated  eigenvectors. 

In  general,  the  subblocks  of  A  are  dense,  those  of  B  are  rather  sparse  {k  20% 
nonzero  elements)  and  Nf,  can  be  very  large  (realistic  values  are  N  =  500  and 
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n  =  800),  so  computer  storage  demands  are  very  high.  Therefore,  we  study  the 
feasibility  of  parallel  computers  with  a  large  distributed  memory  for  solving  (1). 

In  [2],  Jacobi-Davidson  has  been  combined  with  a  parallel  method  to  com¬ 
pute  the  action  of  the  inverse  of  the  block  tridiagonal  matrix  A -aB.  In  this 
approach,  called  DDCR,  a  block-reordering  based  on  a  combination  of  Domain 
Decomposition  and  Cyclic  Reduction  is  combined  with  a  complete  block  LU  de¬ 
composition  of  A- aB.  Due  to  the  special  construction  of  L  and  U,  the  solution 
process  parallelizes  well. 

In  this  paper  we  describe  two  Jacobi-Davidson  variants,  one  using  standard 
Ritz  values  and  one  harmonic  Ritz  values.  The  first  variant  uses  DDCR  to  trans¬ 
form  the  generalized  eigenvalue  problem  into  a  standard  eigenvalue  problem.  In 
the  second  one  DDCR  has  been  applied  as  a  preconditioner  to  solve  approximately 
the  ’correction’  equation.  This  approach  results  also  into  a  projected  standard 
eigenvalue  problem  with  eigenvalues  in  the  dominant  part  of  the  spectrum.  In 
Section  2  both  approaches  are  described.  To  avoid  that  the  projected  system  be¬ 
comes  too  large,  we  make  use  of  a  restarting  technique.  Numerical  results,  based 
on  this  technique,  are  analyzed  in  Section  3.  We  end  up  with  some  conclusions 
and  remarks  in  Section  4. 


2  Parallel  Jacobi-Davidson 

2.1  Standard  Ritz  values 

The  availabilit}'  of  a  complete  LU  decomposition  of  the  matrix  A  —  aB  gives 
us  the  opportunity  to  apply  Jacobi-Davidson  to  a  standard  eigenvalue  problem 
instead  of  a  generalized  eigenvalue  problem.  To  that  end,  we  rewrite  (1)  as 

{A  -  aB)x  =  (A  -  a)Bx. 

If  we  define  Q  [A  ~  aB)  ^ B  then  (2)  can  be  written  as 

Qx  =  /ix,  with  A  =  (7  -f  i. 

A  -cr  n 

The  eigenvalues  we  are  interested  in  form  the  dominant  part  of  the  spectrum 
of  Q,  which  makes  them  relatively  easy  to  find.  The  action  of  the  operator  Q 
consists  of  a  matrix- vector  multiplication  with  R,  a  perfectly  scalable  parallel 
operation,  combined  with  two  triangular  solves  with  L  and  U. 

At  the  fc-th  step  of  Jacobi-Davidson,  an  eigenvector  x  is  approximated  by  a 
linear  combination  of  k  search  vectors  Vj,  j  =  1, 2,  ■  •  • ,  fc,  where  k  is  very  small 
compared  with  Nt.  Consider  the  Nt-hy-k  matrix  U*,  whose  columns  are  given 
by  Vj.  The  approximation  to  the  eigenvector  can  be  written  as  V^s,  for  some 
fc-vector  s.  The  search  directions  Vj  are  made  orthonormal  to  each  other,  using 
Modified  Gram-Schmidt  (MGS),  hence  V^Vk  =  I. 

Let  9  denote  an  approximation  of  an  eigenvalue  associated  with  the  Ritz 
vector  u  =  I4s.  The  vector  s  and  the  scalar  6  are  constructed  in  such  a  way  that 


(2) 

(3) 


254 


VECPAR’98  -  3rd  International  Meeting  on  Vector  and  Parallel  Processing 


the  residual  vector  r  =  QV^s  —  9VkS  is  orthogonal  to  the  k  search  directions. 
From  this  Rayleigh-Ritz  requirement  it  follows  that 


V;:QVkS  =  ev^ VkS  V^QVkS  =  Os.  (4) 

The  size  of  the  matrix  V^QVk  is  k.  By  using  a  proper  restart  technique  k  stays 
so  small  that  this  ’projected’  eigenvalue  problem  can  be  solved  by  a  sequential 
method. 

In  order  to  obtain  a  new  search  direction,  Jacobi-Davidson  requires  the  solu¬ 
tion  of  a  system  of  linear  equations,  called  the  ‘correction  equation’.  Numerical 
experiments  show  that  fast  convergence  to  selected  eigenvalues  can  be  obtained 
by  solving  the  correction  equation  to  some  modest  accuracy  only,  by  some  steps 
of  an  inner  iterative  method,  e.g.  GMRES. 

Below  we  show  the  Jacobi-Davidson  steps  used  for  computing  several  eigen- 
pairs  of  (3)  using  standard  Ritz  values. 

step  0:  initialize 

Choose  an  initial  vector  vi  with  ||t'i||2  =  1;  set  Vi  =  [i;i]; 

Wi  =  [Qni];  fc  =  1;  it  =  1;  Uev  =  0 
step  1:  update  the  projected  system 

Compute  the  last  column  and  row  of  Hk  ■=  V^Wk 
step  2:  solve  and  choose  approximate  eigensolution  of  projected  system 
Compute  the  eigenvalues  6i,  -  ■  ■  ,6^  of  Hk  and  choose  6  :=  dj  with  \dj  \  maximal 
and  Oj  m,  for  j  =  1,  •  ■  •  ,nev’,  compute  associated  eigenvector  s  with  ||s||2  =  1 
step  3:  compute  Ritz  vector  and  check  accuracy 

Let  u  be  the  Ritz  vector  14s;  compute  the  residual  vector  r  ;=  IFa-s  —  du] 
if  ||r||2  <  tolsjD-\S\  then 

riev  :=  Ucv  -I- 1;  :=  6;  if  Uev  =  Nev  stop;  goto  2 

else  if  it  =  iter  stop 
end  if 

step  4:  solve  correction  equation  approximately  with  Hsol  steps  of  GMRES 
Determine  an  approximate  solution  i  of  z  in 
{I  —  uu''){Q  —  9I){I  —  uu*)z  = —r  A  =  0 
step  5:  restart  if  projected  system  has  reached  its  maximum  order 
if  k  =  m  then 

5a:  Set  k  =  kmin  +  Uev  Construct  C  e  C  Hm', 

Orthonormalize  columns  of  C;  compute  Hk  :=  C*HmC 
5b:  Compute  14  :=  14.C;  Wk  ;= 
end  if 

step  6:  add  new  search  direction 

k  k.  +  1;  it  it  +  1;  call  MGS  [14_i,z];  set  14  =  [14-i,i];  Wk  =  [Wk-uQz]-, 

goto  1 

Steps  2  and  5a  deal  with  the  small  projected  system  (4).  Those  sequential 
steps  are  performed  by  all  processors  in  order  to  avoid  communication.  The  basic 
ingredients  of  the  other  steps  are  matrix-vector  products,  vector  updates  and 
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inner  products  Since,  for  our  applications,  Nt  is  much  larger  than  the  number 
of  processors,  those  steps  parallelize  well. 

2.2  Harmonic  Ritz  values 

For  the  introduction  of  harmonic  Ritz  values  we  return  to  the  original  generalized 
i^genva  ue  problem  (1).  Assume  {e,Vks)  approximates  an  eigenpair  (A,a:),  then 
the  residual  vector  r  is  given  by  v  ouen 

r  =  AVkS  -  eBVkS. 

In  case  of  standard  Ritz  values,  the  correction  vector  r  has  to  be  orthogonal  to 

(A  values  approach  asks  for  vectors  r  to  be  orthogonal  to 

(A  -  aB)Vk.  Let  Wk  denote  (A  -  aB)Vk ,  then  we  have 

T  =  AVkS  —  6BVkS 

=  (A-  aB)VkS  -{0~  a)B{A  -  aB)-^(A  -  aBWgs  f5l 

=  WkS-{e-a)B(A-cB)-^WkS. 

,  ,2*’!!,°“®'’’’''°  withrespect 

pe«rum  of  “-e -eighborhood  of  p,  p  must  liein  the  dominant 

spectrum  oiB{A-  aB)  .  The  orthogonalization  requirement  leads  to 

i'Wk*WkS  =  Wk*BVkS.  (6) 

To  obtain  a  standard  eigenvalue  problem  we  require  Wk*Wk  =  I.  By  introducing 
crU)  [A  —  aB)  this  requirement  gives 

Wk*Wk  =  Vk*{A  -  aB)*{A  -  aB)Vk  =  Vk*CVk  =  I  (7) 

and  we  call  I4  a  C-orthonormal  matrix. 

that^^'^  direction  u,  must  be  C7-orthonormal  to  14-i,  which  implies 

F.-i  *u,  =  0  and  u,  = 

,  IKII,’  (8) 

where  wa,  =  (A  -  <rR)uj.. 

aWithrrre^3  the  adjustments  in  the 

ha^rmo^  to  the  original  implementation,  the 

h^monic  case  requires  two  extra  matrix-vector  multiplications  and  in  addition 
ra  memory  to  store  an  A^rby-A:  matrix.  The  main  difference  is  that  the  LU 
technTqT  "  ^  ^  preconditioner  and  not  as  a  shift  and  invert 


3  Numerical  results 

R  this  section,  we  show  some  results  obtained  on  both  an  80  processor  Crav 
Sfy  SI  '''  Netherlands  and  a  512  processor 

message  plfsT^r  ,  "  best  results  were  obtained  by  a 

trancffor  implementation  using  Cray  intrinsic  SHMEM  routines  for  data 

ransfer  and  communication.  For  more  details,  we  refer  to  [2]. 
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Fig.  1.  The  eigenvalue  distribution  of  problem  5 


3.1  Problems 

We  have  timed  five  MHD  problems  of  the  form  (1).  The  Alfven  spectra  of  Prob¬ 
lems  1,  2  and  3,  on  the  one  hand,  and  Problems  4  and  5,  on  the  other  hand, 
do  not  correspond  because  different  MHD  equilibria  have  been  used.  For  more 
details  we  refer  to  CASTOR  [1].  The  choices  of  the  acceptance  criteria  will  be 
explained  in  the  next  section. 

1  A  small  problem  of  TV  =  64  diagonal  blocks  of  size  n  =  48.  We  look  for 
eigenvalues  in  the  neighbourhood  of  cr  =  (-0.08,0.60),  and  stop  after  10 
eigenpairs  have  been  found  with  toUjo  =  10~®  and  tolhjD  =  10“®.  The 
experiments  have  been  performed  on  p  =  8  processors. 

2  The  size  of  this  problem  is  four  times  as  big  as  that  of  the  previous  problem; 
TV  =  128  and  n  =  96.  Again,  we  look  for  eigenvalues  in  the  neighbourhood 
of  cr  =  (-0.08,0.60),  and  stop  after  10  eigenpairs  have  been  found  with 
tolgjD  —  10~®  and  tol^jo  =  10“®.  The  experiments  have  been  performed 
on  p  =  8  processors, 

3  The  same  as  Problem  2,  but  performed  on  p  =  32  processors. 

4  The  size  of  this  large  problem  is:  TV  =  256  and  n  =  256.  We  took  <j  = 
(—0.15,  .15)  and  look  for  TV^,,  =  12  eigenpairs  with  tolsjD  =  10“®  and 
tolhjD  =  10“^.  The  experiments  are  performed  on  p  =  128  processors. 

5  The  size  of  this  very  large  problem  is:  TV  =  4096  and  n  =  64,  we  took 
^  —  (~0.10,  .23)  leading  to  another  branch  in  the  Alfven  spectrum.  Now,  we 
look  for  Nev  =  20  eigenpairs  with  tolsjD  =  10“®  and  tol^jr)  =  10~®.  For 
this  problem  a  slightly  different  acceptance  criterion  has  been  applied: 

Ikilz  <  tolhJD'W  +  i|.||u||2.  (9) 

For  the  harmonic  case,  the  2-norm  of  u  can  be  very  large,  about  10®,  so  the 
results  can  be  compared  with  tolhjD  =  10“®.  At  present,  we  prefer  to  control 
the  residue  as  described  in  Section  3.2.  Figure  1  shows  the  distribution  of  20 
eigenvalues  in  the  neighborhood  of  a  =  (-0.10,  .23). 
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3.2  Acceptance  criterion 


For  the  standard  approach  we  accept  an  eigenpair  (cr + i,  w)  if  the  residual  vector 
satisfies: 

Iklb  =  li(<5  —  i^I)u\\2  <  tolsjD-Wl,  with  ||u||2  =  1  (10) 

and  for  the  harmonic  approach  we  require: 

Iklb  =  ||(>1  -  (cr  +  -))5u||2  <  (or  +  i|,  with  ||u(|f7  =  1.  (11) 

To  compare  both  eigenvalue  solvers  it  is  not  advisable  to  choose  the  tolerance 
parameters  tol^jD  equal  to  tol^jD  in  (10)  and  (11),  respectively.  There  are  two 
reasons  to  take  different  values:  firstly,  within  the  same  number  of  iterations 
the  standard  approach  will  result  into  more  eigenpair  solutions  that  satisfy  (10) 
than  into  solutions  that  satisfy  (11).  Secondly,  if  we  compute  for  each  accepted 
eigenpair  (A,  u)  the  true  normalized  residue  7  defined  by 


■  ^  ||(A-A.B)u||2 

|A|.|HI2 


(12) 


then  we  see  that  the  harmonic  approach  leads  to  much  smaller  7  values. 

In  Figure  2,  the  convergence  behaviour  of  both  the  standard  and  harmonic 
approach  is  displayed,  with  and  without  restarts.  A  o  indicates  that  the  eigenpair 
satisfies  (10)  or  (11),  a  x  denotes  the  7  value.  We  observe  that  the  accuracy 
for  the  eigenpairs  achieved  by  means  of  harmonic  Ritz  values  is  better  than 
suggested  by  toluD-  On  the  other  hand,  tolgjD  seems  to  be  too  optimistic  about 
the  accuracy  compared  to  the  7  values  shown  in  Figure  2.  In  our  experiments  we 
took  tolsjD  =  10  *  and  toluD  =  10“®  and  tolhjo  =  It  is  not  yet  clear  to 
us  how  these  parameters  depend  on  the  problem  size  or  the  choice  of  the  target. 


3.3  Restarting  strategy 

The  algorithm  has  two  parameters  that  control  the  size  of  the  projected  system: 
kmin  and  m.  During  each  restart,  the  kmin  eigenvalues  with  maximal  norm  and 
not  included  in  the  set  of  accepted  eigenvalues,  that  correspond  to  the  kmin  most 
promising  search  directions  are  maintained.  Moreover,  since  an  implicit  deflation 
technique  is  applied  in  our  implementation,  the  eigenpairs  found  so  far  are 
kept  in  the  system  too.  The  maximum  size  m  should  be  larger  than  kmin  + 
where  Ne,,  denotes  the  number  of  eigenvalues  we  are  looking  for.  The  influence 
of  several  {kmin,m)  parameter  combinations  on  both  the  parallel  performance 
and  convergence  behaviour  is  studied. 


3.4  Timing  results  of  {kmin-,m)  parameter  combinations 

For  each  experiment  we  take  m  constant  and  for  kmin  we  choose  the  values 
5, 10,  ■  •  ■ ,  m  Nev  In  Figures  4,  5,  6  and  7,  the  results  of  a  single  m  value  have 
been  connected  by  a  dashed  or  dotted  line.  Experiments  with  several  m  values 
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Fig.  2.  The  two  upper  plots  result  on  problem  4  using  standard  Ritz  values,  the  lower 
two  on  the  same  problem  but  using  harmonic  Ritz  values.  The  first  and  third  one  show 
the  convergence  behaviour  of  Jacobi-Davidson  restarting  each  time  when  the  size  of  the 
projected  system  reaches  in  =  37,  where  kmin  =  25  and  k,„in  =  20,  respectively.  The 
second  and  fourth  plots  demonstrate  the  convergence  in  case  of  no  restarts.  The  process 
ended  when  N^v  =  12  eigenvalues  were  found.  It  may  happen  that  two  eigenvalues  are 
found  within  the  same  iteration  step. 


have  been  performed.  In  the  plots  we  only  show  the  most  interesting  m  values; 
m  reaches  its  maximum  if  eigenpairs  were  found  without  using  a  restart. 
In  the  pictures  this  is  indicated  by  a  solid  horizontal  line,  which  is  of  course 
independent  of  kmin  ■  If  the  number  of  iterations  equals  80  and  besides  less  than 
Ngy  eigenpairs  have  been  found,  we  consider  the  result  as  negative.  This  implies 
that,  although  the  execution  time  is  low,  this  experiment  cannot  be  a  candidate 
for  the  best  (/c„u„,m)  combination. 

Before  we  describe  the  experiments  illustrated  by  Figures  4,  5,  6  and  7  we 
make  some  general  remarks: 

-  We  observed  that  if  a  ikmin,m.)  parameter  combination  is  optimal  on  p 
processors,  it  is  optimal  on  q  processors  too,  with  q. 

-  For  kmin  small,  for  instance  kmin  =  5  or  10,  probably  too  much  information 
is  thrown  away,  leading  to  a  considerable  increase  of  iteration  steps. 

-  For  kmin  large  the  number  of  restarts  will  be  large  at  the  end  of  the  process; 

suppose  that  in  the  extreme  case,  kmin  =  m-Nev,  already  - 1  eigenpairs 

have  been  found,  then  after  a  restart  k  becomes  kmin  +  -  1  =  m  -1. 

In  other  words,  each  step  will  require  a  restart.  In  Figure  3,  the  number  of 
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Fig.  3.  The  number  of  restarts  needed  to  compute  Nev  eigenvalues  of  Problem  2. 

Results  are  shown  for  different  m  values:  m  =  20  (^7  ■  ■  ■},  m  =  25  (H _ _  line), 

7n  =  30  (o  line),  m  =  35  ( x  •  •  •  line),  rn  =  40  (>  -  •  line),  m  =  45  (□ - line). 


restarts  is  displayed  corresponding  to  the  results  of  Problem  2  obtained  with 
harmonic  Ritz  values. 

The  number  of  iterations  is  almost  independent  of  the  number  of  processors 
involved,  it  may  happen  that  an  increase  of  the  number  of  processors  causes 
a  decrease  by  one  or  two  iterations  under  the  same  conditions,  because  the 
LU  decomposition  becomes  more  accurate  if  the  number  of  cyclic  reduction 
steps  increases  at  the  cost  of  the  domain  decomposition  part. 

The  first  example  (Figure  4)  explicitly  shows  that  the  restarting  technique 
can  help  to  reduce  the  wall  clock  time  for  both  the  standard  and  harmonic 
method.  The  minimum  number  of  iterations  to  compute  10  eigenvalues  in  the 
neighborhood  of  a  is  achieved  in  case  of  no  restarts,  viz,  53  for  the  standard  case, 
51  for  the  harmonic  case.  The  least  time  to  compute  10  eigenvalues  is  attained 
for  k„un  =  15  and  m  =  30, 35,  but  also  for  k,nin  =  10  and  m  =  30, 35  and  m  =  40 
and  k„,in  —  15, 20, 25  leads  to  a  reduction  in  wall  clock  time  of  about  15  %.  The 
harmonic  approach  leads  to  comparable  results:  for  {k„iin,m)  =  (15,30  :  35), 
but  also  (kmin,m)  -  (10,30  ;  35)  and  (kmin,rn)  =  (15  :  25,40)  a  reasonable 
reduction  in  time  is  achieved.  The  score  for  =  5  in  combination  with 
rn  -  35  is  striking,  the  unexpected  small  number  of  iterations  in  combination 
with  a  small  kmin  results  into  a  fast  time. 

The  plots  in  Figure  5  with  the  timing  results  for  the  Jacobi-Davidson  process 
for  Problem  2  give  a  totally  different  view.  There  is  no  doubt  of  benefit  from 
restarting,  although  the  numbers  of  iterations  pretty  well  correspond  with  those 
of  Problem  1.  This  can  be  explained  as  follows:  the  size  of  the  projected  system 
k  is  proportionally  much  smaller  compared  to  Nt/p  than  in  case  of  Problem 
1;  both  the  block  size  and  the  number  of  diagonal  blocks  is  twic;e  as  big.  For 
Problem  1  the  sequential  part  amounts  45%  and  36%  of  the  total  wall  clock  time, 
respectively,  for  the  standard  and  harmonic  Ritz  values.  For  Problem  2  these 
values  are  10.5%  and  8%,  respectively.  These  percentages  hold  for  the  most 
expensive  sequential  case  of  no  restarts.  The  increase  of  JD  iterations  due  to 
several  restarts  can  not  be  compensated  by  a  reduction  of  serial  time  by  keeping 
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Fig.  4.  The  upper  pictures  result  on  problem  1  using  standard  Ritz  values.  The  lower 
pictures  result  on  the  same  problem  with  harmonic  Ritz  values.  Results  are  shown 
for  different  m  values:  m  =  20  (v  •  ),  m  =  25  (+  -  •  line),  m  =  30  (o  -  -  line) 
7n  =  35  (x  •  ■ .  line),  m  =  40  (>  -  ■  line),  m  =  45  (□  -  -  line),  rn  =  50  (A  -  ■  line)’ 
The  solid  lines  give  the  value  for  no  restart. 


the  projected  system  small. 

When  we  increase  the  number  of  active  processors  by  a  factor  4,  as  is  done  in 
Problem  4  (see  Figure  6),  we  observe  that  again  a  reduction  in  wall  clock  time 
can  be  achieved  by  using  a  well-chosen  {k„iin,Tn)  combination.  The  number 
of  iterations  slightly  differ  from  those  given  in  Figure  5,  but  the  pictures  with 
the  Jacobi-Davidson  times  look  similar  to  those  in  Figure  5.  If  we  should  have 
enlarged  by  a  factor  of  4  and  left  the  block  size  unchanged,  we  may  expect 
execution  times  as  in  Figure  5. 

For  Problem  4,  the  limit  of  80  iterations  seems  to  be  very  critical.  The  right- 
hand  plots  of  Figure  7  demonstrate  that  the  number  of  iterations  does  not  de¬ 
crease  monotonously  when  kmin  increases  for  a  fixed  value  m  as  holds  for  the 
previous  problems.  Moreover,  it  may  happen  that  for  some  (fc,„i„,m)  combina¬ 
tion,  the  limit  of  JD  iterations  is  too  strictly,  while  for  both  a  smaller  and  larger 
k„rin  value  the  desired  Ne,,  eigenpairs  were  easily  found.  In  the  left-hand  plots 
only  those  results  are  included,  which  generate  12  eigenvalues  within  80  itera¬ 
tions.  Apparently,  for  the  standard  case  with  m  =  57  and  30  <  k,nin  <  45,  even 
less  iterations  are  required  than  in  case  of  no  restarts.  Of  course,  this  will  lead 
to  a  time  which  is  far  better  than  for  the  no-restart  case.  For  the  harmonic  ap¬ 
proach  the  behavior  of  the  number  of  JD  steps  is  less  obvious,  but  also  here  the 
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Fig.  5.  The  upper  pictures  result  on  problem  2  using  standard  Ritz  values.  The  lower 
pictures  result  on  the  same  problem  with  harmonic  Ritz  values.  Results  are  shown 
for  different  m  values:  m  =  20  (v  •  •).  m  =  25  (+  -  •  line),  m  =  30  (o  -  -  line), 

»n  -  35  (x  •  •  ■  line),  m  =  40  (>  -  •  line),  m  =  45  (D  -  -  line).  The  solid  lines  give  the 
value  for  no  restart. 


ig.  6.  The  left  pictures  results  on  problem  3  using  standard  Ritz  values.  The  right 
pictures  result  on  the  same  problem  with  harmonic  Ritz  values.  Results  are  shown 
for  different  m  values:  m  =  20  (y  •  ••))»”  =  25  (+  -  ■  line),  m  =  30  (o - line) 

m  -  35  (x  .  • .  line),  m  =  40  (>  -  •  line),  m  =  45  (□  -  -  line).  The  solid  lines  give  the 
value  for  no  restart. 
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Fig.  r.  The  upper  pictures  result  on  problem  4  using  standard  Ritz  values.  The  lower 
pictures  result  on  the  same  Problem  with  harmonic  Ritz  values.  Results  are  shown  for 

different  m  values:  m  =  37  (x  ■  line),  m  =  42  (>  -  •  line),  m  =  47  (□ - line), 

m  =  52  (\7  —  •  line),  7n  =  57  (h —  ■  line),  in  =  62  (o - line).  The  solid  lines  give  the 

value  for  no  restart. 


monotonicity  is  lost.  Execution  times  become  unpredictable  and  the  conclusion 
must  be  that  it  is  better  not  to  restart. 

3.5  Parallel  execution  timing  results 

Table  1  shows  the  execution  times  of  several  parts  of  the  Jacobi-Davidson  algo¬ 
rithm  on  the  Cray  T3E;  the  numbers  in  parentheses  show  the  Gflop-rates.  We 
took 

=  20;  tolsjD  =  10“®;  toluD  =  10~^  k„^in  =  10;  m  =  SO+N,,,-,  itsoL  =  0. 

The  number  of  eigenvalues  found  slightly  depends  on  the  number  of  processors 
involved:  about  11  for  the  standard  and  13  for  the  harmonic  approach  within  80 
iterations. 

The  construction  of  L  and  17  is  a  very  time-consuming  part  of  the  algorithm. 
However,  with  a  well-chosen  target  a  ten  up  to  twenty  eigenvalues  can  be  found 
within  80  iterations.  Hence,  the  life-time  of  a  (L,  U)  pair  is  about  80  iterations. 
On  account  of  the  cyclic  reduction  part  of  the  LU  factorization,  a  process  that 
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Table  1.  Wall  clock  times  in  seconds  for  the  standard  and  harmonic  Ritz  approach 
N  =  4096,  n  =  64. 


p 

Preprocessing 

Time 

standard  JD 

Time 

harmonic  JD 

Triangular 

solves 

32 

7.90 

(6.75) 

64.59 

88.61 

25.56  (2.08) 

64 

4.08 

(13.21) 

31.70 

43.78 

13.28  (4.02) 

128 

2.19 

(24.78) 

15.07 

21.33 

7.28  (7.36) 

256 

1.27 

(42.69) 

8.55 

11.48 

4.36  (12.29) 

512 

0.84 

(64.65) 

5.64 

7.02 

3.01  (17.81) 

starts  on  all  processors,  while  at  each  step  half  of  the  active  processors  becomes 
idle,  we  may  not  expect  linear  speed-up.  The  fact  that  the  parallel  performance 
of  DDCR  is  quite  good  is  caused  by  the  domain  decomposition  part  of  the  LU. 
For  more  details  we  refer  to  [2,  5]. 

About  40%  of  the  execution  time  is  spent  by  the  computation  of  the  LU 
factorization  (in  Table  1  ‘Preprocessing'),  which  does  not  depend  on  the  number 
of  processors.  The  storage  demands  for  Problem  5  are  so  large  that  at  least 
the  memories  of  32  processors  are  necessary.  DDCR  is. an  order  0{Nn^)  process 
performed  by  Level  3  BLAS  and  it  needs  less  communication:  only  sub-  and 
super  diagonal  blocks  of  size  n-by-n  must  be  transfered.  As  a  cojisequence,  for 
the  construction  of  L  and  U,  the  communication  time  can  be  neglected  also  due 
to  the  fast  communication  between  processors  on  the  Cray  T3E.  The  Gflop-rates 
attained  for  the  construction  of  the  LU  are  impressively  high  just  like  its  parallel 
speed-up. 

The  application  of  L  and  U,  consisting  of  two  triangular  solves,  is  the  most 
expensive  component  of  the  JD  process  after  preprocessing.  It  parallelizes  well, 
but  its  speed  is  much  lower,  because  it  is  built  up  of  Level  2  BLAS  opera¬ 
tions.  The  wall  clock  times  for  standard  and  harmonic  JD  are  given  including 
the  time  spent  on  the  triangular  solves.  Obviously,  a  harmonic  iteration  step  is 
more  expensive  than  a  standard  step,  but  the  overhead  becomes  less  when  more 
processors  are  used,  because  the  extra  operations  parallelize  very  well. 

4  Conclusions 

We  have  examined  the  convergence  behaviour  of  two  Jacobi-Davidson  variants, 
one  using  standard  Ritz  values,  the  other  one  harmonic  Ritz  values.  For  the 
kind  of  eigenvalue  problems  we  are  interested  in,  arising  from  MagnetoHydro- 
Dynamics,  both  methods  converge  very  fast  and  parallelize  pretty  well.  With 
tolsjD  =  10  ®  and  tol^jo  =  10~^  in  the  acceptance  criteria  (10)  and  (11), 
lespectively,  both  variants  give  about  the  same  amount  of  eigenpairs.  The  har¬ 
monic  variant  is  about  20%  more  expensive,  but  results  into  more  accurate  eigen¬ 
pairs.  With  a  well-chosen  target  ten  up  to  twenty  eigenvalues  can  be  found.  Even 
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for  very  large  problems,  Nt  =  65, 536  and  Nt  =  262, 144,  we  obtain  more  than 
10  sufficient  accurate  eigenpairs  in  a  few  seconds. 

Special  attention  has  been  paid  to  a  restarting  technique.  The  {kmin,m) 
parameter  combination  prescribes  the  amount  of  information  that  remains  in 
the  system  after  a  restart  and  the  maximum  size  of  the  projected  system.  In  this 
paper  we  have  demonstrated  that  k„iin  may  not  be  too  small,  because  then  too 
much  information  gets  lost.  On  the  other  hand,  too  large  k-min  values  lead  to 
many  restarts  and  become  expensive  in  execution  time.  In  general,  the  number 
of  iterations  decreases  when  m  increases.  It  depends  on  the  Nt/p  value,  as  we 
have  shown,  whether  restarts  lead  to  a  reduction  in  the  wall  clock  time  for  the 
Jacobi-Davidson  process. 
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Abstract.  This  paper  shows  how  the  symmetric  eigenproblem,  which 
is  the  computationally  most  demcinding  part  of  numerous  scientific  and 
industrial  applications,  can  be  solved  much  more  efficiently  than  by  using 
algorithms  currently  implemented  in  Lapack  routines. 

The  main  techniques  used  in  the  algorithm  presented  in  this  paper  are 
(i)  sophisticated  blocking  in  the  tridiagonalization,  which  leads  to  a  two- 
sweep  algorithm;  and  (ii  )  the  computation  of  the  eigenvectors  of  a  band 
matrix  instead  of  a  tridiagonal  matrix. 

This  new  algorithm  improves  the  locality  of  data  references  and  leads  to 
a  significant  improvement  in  the  floating-point,  performance  of  symmetric 
eigensolvers  on  modern  computer  systems.  Speedup  factors  of  up  to  four 
(depending  on  the  computer  architecture  and  the  matrix  size)  have  been 
observed. 


Keywords:  Numerical  Linear  Algebra,  Symmetric  Eigenproblem,  Tridiagonalization, 
Performance  Oriented  Numerical  Algorithm,  Blocked  Algorithm 

1  Introduction 

Reducing  a  dense  symmetric  matrix  A  to  tridiagonal  form  T  is  an  important 
preprocessing  step  in  the  solution  of  the  symmetric  eigenproblem.  Lapack  (An¬ 
derson  etal.  [1])  provides  a  blocked  tridiagonalization  routine  whose  memory 
reference  patterns  are  not  optimal  on  modern  computer  architectures.  In  this 
Lapack  routine  a  significant  part  of  the  computation  is  performed  by  calls  to 
Level  2  Blas  routines.  Unfortunately,  Level  2  Blas  do  not  have  a  ratio  of 
floating-point  operations  to  data  movement  that  is  high  enough  to  enable  effi¬ 
cient  reuse  of  data  that  reside  in  cache  or  local  memory  (see  Table  1).  Thus, 

This  work  was  supported  by  the  Austrian  Science  Foundation  (FWF). 
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software  construction  based  on  calls  to  Level  2  routines  is  not  well  suited  to 
computers  with  a  memory  hierarchy  and  multiprocessor  machines. 

Table  1.  Ratio  of  floating-point  operations  to  data  movement  for  three  closely  related 
operations  from  the  Level  1,  2,  and  3  Blas  (Dongarra  et  al.  [6]) 


Blas 

Routine 

Memory 

Accesses 

Flops 

Flops  per 
Memory  Access 

Level  1 

daxpy 

3n 

2n 

2/3 

Level  2 

dgemv 

2(7- 

2 

Level  3 

dgesm 

4u^ 

2)7" 

nf2 

Bischof  et  al.  [4,5]  recently  developed  a  general  framework  for  reducing  the 
bandwidth  of  symmetric  matrices,  which  improves  data  locality  and  allows  for 
the  use  of  Level  3  Blas  instead  of  Level  2  Blas. 

In  this  paper  we  introduce  an  important  modification  to  this  framework  in 
case  eigenvectors  have  to  be  computed,  too:  Accumulating  the  transformation 
information  for  the  eigenvectors  incurs  an  overhead  which  can  outweigh  the  ben¬ 
efits  of  a  multisweep  reduction  (as  remarked  in  Bischof  et  al.  [4,5]).  Therefore  we 
compute  the  required  eigenvectors  directly  from  the  intermediate  band  matrix. 
Analyses  and  experimental  results  show  that  our  approach  to  improving  memory 
access  patterns  is  superior  to  established  algorithms  in  many  cases. 

2  Eigensolver  with  Improved  Memory  Access 

A  real  symmetric  n  x  n  matrix  A  can  be  factorized  as 

A  =  V^BV  =  Q^TQ  =  Z^AZ  (1) 

where  V ,  Q,  and  Z  are  orthogonal  matrices.  B  is  a  symmetric  band  matrix  with 
band  width  26  +  1,  T  is  a  symmetric  tridiagonal  matrix,  and  A  is  a  diagonal 
matrix  whose  diagonal  elements  are  the  eigenvalues  of  the  (similar)  matrices  A, 
B  and  T.  The  column  vectors  of  Z  a.re  the  eigenvectors  of  .4. 

Lapack  reduces  a  given  matrix  A  to  tridiagonal  form  T  by  applying  House¬ 
holder  similarity  transformations.  A  bisection  algorithm^  is  used  to  compute 
selected  eigenvalues  of  T,  which  are  also  eigenvalues  of  A.  The  eigenvectors  of  T 
are  found  by  inverse  iteration  on  T.  These  eigenvectors  have  to  be  transformed 
into  the  eigenvectors  of  A  using  the  transformation  matrix  Q. 

The  new  method  proposed  does  not  compute  the  tridiagonal  matrix  T  from 
directly,  but  derives  a  band  matrix  B  as  an  intermediate  result.  This  band 
reduction  can  be  organized  with  good  data  locality,  which  is  critical  for  high 
performance  on  modern  computer  architectures.  Using  a  block  size  6  in  the  first. 

^  If  all  eigenvalues  are  required.  Lapack  also  provides  other  algorithms. 
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reduction  sweej)  results  in  a  banded  matrix  B  with  a  semibandwidth  of  at  least 
b.  This  relationship  leads  to  a  tradeoff: 

-  If  smaller  values  of  b  are  chosen,  then  a  larger  number  of  elements  of  A 
are  eliminated.  This  decrease  in  the  number  of  non-zero  elements  leads  to  a 
smaller  amount  of  data  to  be  processed  in  later  steps. 

-  If  larger  values  of  b  are  chosen,  then  better  performance  improvements  are 
obtained  in  the  first  reduction  swee]3. 

Using  appropriate  values  of  b  the  two-sweep  tridiagonal  reduction  achieves 
speedups  of  up  to  ten  as  compared  with  the  Lapack  tridiagonal  reduction  (see 
Gansterer,  Kvasnicka  [7,8]). 

The  eigenvectors  can  be  computed  by  invei’se  iteration  either  from  .4,  B,  or 
T.  In  numerical  exjjeriments  in^'erse  iteration  on  B  turned  out.  to  be  the  most 
eftective.  The  eigenvectors  of  .4  have  to  be  computed  from  the  eigenvectors  of  B 
using  the  orthogonal  transformation  matrix  U. 

The  new  algorithm  includes  two  special  cases: 

1.  If  V  =  Q  then  B  =  T,  and  the  inverse  iteration  is  performed  on  the  tridiag¬ 
onal  matrix  T.  This  variant  coincides  with  the  L.apack  algorithm, 

2.  If  V"'  =  /  then  B  =  A.  and  the  inverse  iteration  is  performed  on  the  original 
matrix  A.  This  variant,  is  to  be  preferred  if  only  a  few  eigem’ectors  ha^’e  to 
be  computed  and  the  corresponding  eigenvalues  are  known. 


3  The  New  Level  3  Eigensolver 

The  blocked  Lapack  apjDroach  puts  t.he  main  emjjhasis  on  accumulating  several 
elimination  steps.  6  rank-2  updates  (each  of  them  a  Level  2  Blas  operation)  are 
aggregated  to  form  one  rank-2fc  update  and  hence  one  Level  3  Blas  operation. 
However,  this  approach  does  not  take  into  account  memory  access  patterns  in 
the  updating  proce.ss.  We  tried  to  introduce  blocking  in  a  much  stricter  sense, 
namely  by  using  blocked  access  patterns. 

Partitioning  matrix  .4  into  block  columns,  and  further  partitioning  each 
block  column  into  quadratic  submatrices  makes  it  possible  to  process  the  block- 
columns  and  their  submatrices  the  same  way  that  the  single  columns  and  their 
elements  are  processed  in  the  original  unblocked  method:  all  blocks-  below  the 
subdiagonal  block  are  eliminated,  which  leaves  a  block  tridiagonal  matrix,  i.e.,  a 
band  matrix  (when  considered  elementwise).  At  first  .sight  this  matrix  has  band¬ 
width  46—1.  Further  examination  shows  that  the  first  elimination  sweep  can  be 
organized  such  that  the  subdiagonal  blocks  in  the  block  tridiagonaJ  matrix  are 
of  upper  triangular  form.  Hence  the  band  whdth  of  B  can  be  reduced  to  26  -f  1. 

There  are  two  characteristics  which  distinguish  the  newdy  developed  Level  3 
algorithm  from  standard  algorithms  (see  Fig.  1): 

"  Rejdace  blockhy  elemcnl  (oi'  eh  mentwisc]  to  get  the  original,  unblocked,  algorithm. 
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Two-sweep  tridiagonalization:  The  first  sweep  reduces  tlie  matrix  .4  to  a 
band  matrix  B.  The  second  sweep  reduces  B  to  a  tridiagonal  matrix  T. 
No  fill-in  occurs  in  the  first  reduction  sweep.  In  the  second  sweep,  however, 
additional  operations  are  needed  to  remove  fill-in. 

Inverse  iteration  on  the  band  matrix:  Calculating  the  eigenvectors  of  B 
avoids  the  large  or-erhead  entailed  by  the  backtransformation  of  the  eigen¬ 
vectors  of  T.  On  the  other  hand,  the  overhead  caused  by  inverse  iteration 
on  B  does  not  outweigh  the  benefits  of  this  approach. 


Eigenvectors 
of  A 


Eigenvectors 

ofB 


Eigenvalues 


Fig.  1.  Basic  concept  of  the  new  eigen.soh’er 


4  Implementation  Details 

The  resulting  algorithm  has  five  steps. 

1.  Reduce  the  matrix  .4  to  the  baud  matrix  B. 

(a)  Compute  the  tran.sformation 

if  .41'}  =  (/  -  ui.vl) ...(/-  npi/'f  ).4(7  -  uy„'()  ■■■(]-  uyul ) 

for  a  properly  choserr  subblock  of  .4.  4 he  Householder  vectors  for  the 
transformation  of  the  columns  of  this  snbblock  do  not  depend  on  each 
other.  This  incle]rendence  enables  a  new  way  of  blocking  in  Step  Ic. 

(b)  (..ollect  the  Householder  vectors  as  column  vectors  in  an  n  x  b  matrix  }' 
and  compute  the  ?)  x  h  nratrix  H’  such  that  the  transformation  matrix  Iv 
is  represented  as  (7  -  lfV'7').  Matrix  V  can  be  stored  in  those  parts  of 
.4  whtch  have  just  been  eliminated.  Matrix  H'  requires  separate  storage 
of  order  0{ii-)-.  therefore  it  is  overwritten  and  has  to  be  computed  again 
in  the  backtransfornratiorr  step. 
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(c)  Perform  a  rank-26  u]xlal.e  using  the  update  matrix  (/  - 

(d)  Iterate  Steps  la  to  Ic  over  the  entire  matrix  .4. 

2.  Reduce  the  matrix  B  to  the  tridiagonal  matrix  T.  Fill-in  increases  the  total 
number  of  operations  needed  for  the  tridiagonalization. 

3.  Compute  the  desired  eigenvalues  of  the  tridiagonal  matrix  T. 

4.  Compute  the  corresponding  eigenvectors  of  B  by  inverse  iteration.  The  com¬ 
putation  of  the  eigenvectors  of  B  requires  a  higher  amount  of  operations 
than  the  computation  of  the  eigenvectors  of  T. 

5.  Transform  the  eigenvectors  of  B  to  the  eigenvectors  of  the  input  matrix  .4. 
The  update  matrices  W  have  to  be  computed  anew.  The  transformation 
matrix  1’,  which  occurs  in  (1).  is  not  computed  explicitly. 

Variations  of  the  representation  of  I'i  are  (see  Bischof  [3]) 

-  \i  =  (/  -f-  14  >  ^).  The  matrix  V  holds  a  set  of  Householder  vectors  whereas 
the  matrix  14^  is  computed  explicitly.  This  version  is  actually  used  in  the 
current  implementation. 

-  Iv  =  (/  -  GG^)  (see  Schreiber,  Parlett  [9]).  This  version  needs  higher  effort 
for  computing  G  as  well  as  for  coding.  However,  memory  for  storing  W  and 
the  redundant  effort  to  compute  14^  twice  is  saved. 

-  Vi  =  (/  -  YUY'^)  (see  Schreiber,  Van  Loan  [10])  with  a  6  x  6  triangular 
matrix  G.  The  storage  requirement  for  G  is  nearly  negligible. 

The  new  algorithm  and  its  variants  make  it  po.ssible  to  use  Level  3  Blas 
in  all  computations  involving  the  original  matrix,  in  contrast  to  the  Lapack 
routine  dsyevx.  This  routine  performs  50%  of  the  operations  needed  for  the 
tridiagonalization  in  Level  2  Blas. 

5  Complexity 

The  total  execution  time  T  of  our  algorithm  consists  of  five  parts: 

T]  ~  ci(6)n^  for  reducing  the  symmetric  matrix  /I  to  a  band  matrix  B, 

To  ~  C2ib)n^  for  reducing  the  band  matrix  B  to  a  tridiagonal  matrix  T, 

Ta  ~  C3(6*)?!'  for  computing  h  eigenvalues  of  T, 

T4  ~  C4{b)kb~n-  for  computing  the  corresponding  k  eigenvectors  of  B.  and 
Ts  ~  c^kn^  for  transforming  the  eigenvectors  of  B  into  the  eigenvectors  of  .4. 

The  parameters  Cj .  Co,  and  C4  depend  on  the  semibandwidth  6.  The  parameter 
Cl  decreases  in  6,  whereas  co  and  C4  increase  in  6  due  to  an  increasing  number 
of  operations  to  be  performed.  The  parameter  C3  depends  on  the  number  of 
computed  eigenvectors  k;  whereas  C5  is  independent  of  the  problem  size. 

The  band  reduction  step  is  the  only  part  of  the  algorithm  requiring  an  O(n^) 
effort.  Thus,  a  large  semibandwidth  b  of  B  seems  to  be  desirable.  However,  6 
should  be  chosen  appropriately  not  only  to  speed  up  the  band  reduction  (Ti), 
but  also  to  make  the  tridiagonalization  (T->)  and  the  eigenvector  computation 
(T4)  as  efficient  as  possible. 


271 


FEUP  ■  Faculdade  de  Engenharia  da  Universidade  do  Porto 


For  example,  on  an  SGI  Power  Challenge  the  calculation  of  k  =  200  eigenval¬ 
ues  and  eigenvectors  of  a  symmetric  2000  x  2000  matrix  requires  a  total  execution 
time  T  —  Sis.  T\  is  55s,  i.e..  63%  of  the  total  time.  T4  is  16s,  i.e.,  18%.  of  T. 

The  other  steps  of  the  algorithm  require  an  insignificant  part  of  the  execution 
time. 


6  Results 

A  fiist  implementation  of  our  algorithm  uses  routines  from  the  symmetric  band 
reduction  toolbox  (SBR:  see  Bischof  etal.  [2.4.5]),  Eispack  routines  (Smith 
etal.  [11]),  Lapack  routines  (Anderson  et  al.  [1]),  and  some  of  our  own  routines. 

In  numerical  experiments  we  compare  the  well  established  Lapack  routine 
dsyevx  with  the  new  algorithm.  On  an  SGI  Power  Challenge  (with  an  R8000 
processor  running  with  90  MHz),  speedup  factors  of  up  to  4  were  observed  (see 
Table  2  and  Fig.  5). 


Table  2.  Execution  times  (in  seconds)  on  an  SGI  Power  ChaUenge.  k  =  n/10  of  the  n 
eigenvalues  and  eigenvectors  were  computed 


n 

k 

h 

Lapack 

dsyevx 

New 

Method 

Speedup 

500 

50 

6 

1.5  .s 

2.1s 

0.7 

1000 

100 

6 

16.6s 

12.6s 

1.3 

1500 

1.50 

6 

74.4  s 

,39.2  s 

1.9 

2000 

0 

0 

10 

239.2  s 

89.7  s 

2.7 

3000 

300 

12 

945. 5, s 

286.4  s 

3.3 

4000 

400 

12 

2432.4  s 

660.5  s 

3.7 

Fig.  2  shows  the  normalized  computing  time  T{n)/n^.  The  significant  speed¬ 
up  of  the  new  algorithm  when  applied  to  large  problems  is  striking.  This  speedup 
has  nothing  to  do  with  complexity,  which  is  nearly  identical  for  both  algorithms 
(see  Fig,  3).  The  reason  for  the  good  performance  of  the  new  algorithm  is  its 
significantly  improved  utilization  of  the  computer's  potential  peak  performance 
(see  Fig.  4).  The  deteriorating  efficiency  of  the  Lapack  routine  (due  to  cache 
effects)  causes  the  0(»?^)  behavior  of  its  computation  time  between  n  =  500  and 
n  =  2000  (see  Fig.  2). 

The  new  algorithm  shows  significant  speedups  compared  to  existing  algo¬ 
rithms  for  large  matrices  (?;  >  1500)  and  a  small  subset  of  eigenvalues  and 
eigenvectors  (k  =  n/10)  on  all  computers  at  our  disposal,  including  worksta¬ 
tions  of  DEC,  HP,  IBM,  and  SGI. 

Choosing  the  block  si.ze  h.  Experiments  show  that  the  optimum  block  size  h. 
which  equals  the  smallest  possible  band  width,  increases  slightly  with  larger 
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Normalized  Computing  Time 


Order  n  of  the  matrix 

Fig.  2.  Normalized  computing  time  T{n)/n^  in  nanoseconds 


Normalized  Complexity 


Order  n  of  the  matrix 

Fig.  3.  Normalized  number  op(n)/n®  of  floating-point  operations 


Floating-Point  Performance  Mflop/s 


Order  u  of  the  matrix 

Fig. 4.  Floating-point  performance  (MFlop/s)  and  efficiency  {%) 
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Speedup 


Fig.  5.  Speedup  of  the  new  algorithm  as  compared  with  the  Lapack  routine  dsyevx. 
Performance  improvements  are  all  the  better  if  k  is  only  a  small  percentage  of  the 
n  =  3000  eigenvalues  and  eigenvectors 


matrix  sizes  (see  Table  2).  A  matrix  of  order  500  requires  a  6  of  only  4  or  6 
for  optimum  performance,  depending  on  the  architecture  (and  the  size  of  cache 
lines).  Band  width.?  of  12  or  16  are  optimum  on  most  architectures  when  the 
matrix  order  is  larger  than  2000.  A  hierarchically  blocked  version  which  is  cur¬ 
rently  under  development  allows  increasing  the  block  size  without  increasing  the 
band  width  and  therefore  leads  to  even  higher  performance. 


7  Conclusion 


We  presented  an  algorithm  for  the  computation  of  selected  eigenvalues  and  eigen¬ 
vectors  of  symmetric  matrices  which  is  significantly  faster  than  existing  algo¬ 
rithms.  This  speedup  is  achieved  by  improved  blocking  in  the  tridiagonalization 
process,  which  significantly  improves  data  locality. 

Lapack  algorithms  for  the  symmetric  eigeiiproblem  spend  up  to  80  %  of  their 
execution  time  in  Level  2  Blas,  which  do  not  perform  well  on  cache-based  and 
multiprocessor  computers.  In  our  algorithm  all  performance  relevant  steps  make 
use  of  Level  3  Blas. 

The  price  that  has  to  be  paid  for  the  improved  performance  in  the  tridiagonal- 
ization  process  is  that  the  eigenvectors  cannot  be  computed  from  the  tridiagonal 
matrix  because  of  prohibitive  overhead  in  the  backtransformation.  They  have  to 
be  computed  from  the  intermediate  band  matrix. 

When  choosing  the  block  size,  a  compromi.se  has  to  be  made:  Larger  block 
sizes  improve  the  performance  of  the  band  reduction.  Smaller  block  sizes  have 
to  be  used  to  reduce  the  band  width  and,  therefore,  to  speed  up  the  inverse  iter¬ 
ation  on  the  band  matrix.  This  property  makes  the  new  algorithm  particularly 
attractive  (on  most  architectures)  if  not  all  eigenvectors  are  required. 
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If  the  gap  between  processor  speed  and  memory  bandwidth  further  increases, 
our  algorithm  will  be  highly  competitive  also  for  solving  problems  where  all 
eigenvectors  are  required. 

Future  Development.  Routines  dominated  by  Level  3  Blas  operations,  like  the 
eigensolver  presented  in  this  paper,  have  the  potential  of  speeding  up  almost 
linearly  on  parallel  machines.  That  is  why  we  are  currently  developing  a  parallel 
version.  Another  promising  possibility  of  development  is  the  nse  of  hierarchical 
blocking  (see  Ueberhuber  [12]). 
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Abstract.  Peirallel  algorithms  for  solving  ^Jmost  linear  systems  cire  stud¬ 
ied.  A  non-stationary  parallel  algorithm  based  on  the  multisplitting  tech¬ 
nique  cind  its  extension  to  an  asynchronous  model  are  considered.  Con¬ 
vergence  properties  of  these  methods  are  studied  for  M-matrices  and 
//-matrices.  We  implemented  these  algorithms  on  two  distributed  mem¬ 
ory  multiprocessors,  where  we  studied  their  performance  in  relation  to 
overlapping  of  the  sphttings  at  each  iteration. 


1  Introduction 

We  are  interested  in  the  parallel  solution  of  almost  linear  systems  of  the  form 

Ax  +  ^{x)  =  b,  (1) 

where  A  =  (a,j)  is  a  real  n  x  n  matrix,  x  and  6  are  n- vectors  and  ^  :  IR"  — )■  IR” 
is  a  nonlinear  diagonal  mapping  (i.e.,  the  zth  component  of  ^  is  a  function 
only  of  X,). 

These  systems  appear  in  practice  from  the  discretization  of  differential  equa¬ 
tions,  which  arise  in  many  fields  of  applications  such  as  trajectory  calculation  or 
the  study  of  oscillatory  systems;  see  e.g.,  [3],  [5]  for  some  examples. 

Considering  that  system  (1)  has  in  fact,  a  unique  solution.  White  [18]  intro¬ 
duced  the  parallel  nonlinear  Gauss-Seidel  algorithm,  based  on  both  the  classical 
nonlinear  Gauss-Seidel  method  (see  [13])  and  the  multisplitting  technique  (see 
[12]).  Until  then,  the  multisplitting  technique  had  only  been  used  for  linear 
problems.  Recently,  in  the  context  of  relaxed  methods,  Bai  [1]  has  presented  a 
class  of  algorithms,  called  parallel  nonlinear  AOR  methods,  for  solving  system 
(1).  These  methods  are  a  generalization  of  the  parallel  nonlinear  Gauss-Seidel 
algorithm  [18]. 

In  order  to  get  a  good  performance  of  all  processors  and  a  good  load  balance 
among  processors,  in  this  paper  we  extend  the  idea  of  the  non-stationary  meth¬ 
ods  to  the  problem  of  solving  the  almost  linear  system  (1).  This  technique  was 
introduced  in  [6]  for  solving  linear  systems,  (see  also  [8],  [11])-  In  a  formal  way. 
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let  us  consider  a  collection  of  splittings  A  =  {D-  (  =  1,2,..., 

k  1,  2, ....  a,  m  =  1,2,...,  q[£,  k),  such  that  D  =  diag(i4)  is  nonsingular  and 
Le,k,m  are  strictly  lower  triangular  matrices.  Note  that  matrices  Ue,k,m  are  not 
generally  upper  triangular.  Let  Ek  be  nonnegative  diagonal  matrices  such  that 

f^Ek  =  I.  . 

k  =  l 

Let  us  define  rj  :  IR  IR,  1  <  f  <  n,  such  that 

ri{t)  =  ant +  teIR,  (2) 

and  suppose  that  there  exists  the  inverse  function  of  each  r,-,  denoted  by  r~^. 

Let  us  consider  the  operators  Pi_k,m.  ;  IR"  ->■  IR”  such  that  each  of  them 
maps  X  into  y  in  the  following  way 

{yi  =  uiyi  +  (1  -  uj)xi,  1  <  t  <  n,  w  e  IR,  w  7^  0, 

and  y.-  =  with  (3) 

-  =  F^t.k.my  +  (1  -  fJ,)Lt^k,mX  +  Ui^k,mX  +  6,  ^  G  IR. 

With  this  notation,  the  following  algorithm  describes  a  non-stationary  par¬ 
allel  nonlinear  method  to  solve  system  (1).  This  algorithm  is  based  on  the  AOR- 
type  methods.  It  is  assumed  that  processors  update  their  local  approximation 
as  many  times  as  the  non-stationary  parameters  q(£,  k)  indicate. 

Algorithm  1  (Non-stationary  Parallel  Nonlinear  Alg.). 

Given  the  initial  vector  x^°'>,  and  a  sequence  of  numbers  of  local  iterations 
qiC.k).  £=1,2,...,  k  =  l,2,...,a 

For  £  =  1,2,...,  until  convergence 
In  processor  k,  k  =  I  to  a 

For  m  =  1  to  q{£,  k) 

k=l 

We  note  that  Algorithm  1  extends  the  nonlinear  algorithms  introduced  in  [1] 
and  [18].  Moreover,  Algorithm  1  reduces  to  Algorithm  2  in  [8],  when  ^(r)  =  0 
and  for  alH  =  1,  2, . . .,  m  =  1, 2, . . . ,  q(£,  k),  Lt,k,m  =  L/t  and  Ut,k,m  =  Uk,  k  = 
1, 2, . . .,  a.  Here,  the  formulation  of  Algorithm  1  allows  us  to  use  different  split¬ 
tings  not  only  in  each  processor  but  at  each  global  iteration  £  and/or  at  each 
local  iteration  m.  Furthermore,  the  overlap  is  allowed  as  well. 

In  this  algorithm  all  processors  complete  their  local  iterations  before  updating 
the  global  approximation  x^^'> .  Thus,  this  algorithm  is  synchronous. 

To  construct  an  asynchronous  version  of  Algorithm  1  we  consider  an  iterative 
scheme  on  1R“".  More  precisely,  we  consider  that,  at  the  ah  iteration,  processor 
k  performs  the  calculations  corresponding  to  its  q[£,  k)  splittings,  saving  the 
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update  vector  in  x],  ,  k  =  1,2, . .  .,a.  Moreover,  at  each  step,  processors  make 
use  of  the  most  recent  vectors  computed  by  the  other  processors,  which  are 
previously  weighted  with  the  matrices  Ek,  k  =  1,2, . .  .,a. 

In  a  formal  way,  let  us  define  the  sets  Ji  C  {1,  2, . . . ,  a},  £  =  1,2,...,  as 
k  €  Jt  if  the  Arth  part  of  the  iteration  vector  is  computed  at  the  ^th  step. 
The  superscripts  r{£,  k)  denote  the  iteration  number  in  which  the  processor  k 
computed  the  vector  used  at  the  beginning  of  the  fth  iteration. 

As  it  is  customary  in  the  description  and  analysis  of  asynchronous  algorithms 
(see  e.g.,  [2],  [4]),  we  always  assume  that  the  superscripts  r{i,k)  and  the  sets  Jt 
satisfy  the  following  conditions 

r{£,  k)  <  t  for  all  A-  =  1, 2, . . . ,  a,  t=l,2, _  (4) 

lim  r[t,  k)  =  00  for  all  A:  =  1, 2, ... ,  a.  (5) 

The  set  {f  |  A:  G  Jt]  is  unbounded  for  all  Ar  =  1, 2, . . . ,  a.  (6) 

Let  us  consider  the  operators  Pi^k,m  used  in  Algorithm  1.  With  this  notation, 
the  asynchronous  counterpart  of  that  algorithm  corresponds  to  the  following 
algorithm. 


Algorithm  2  (AsYNC.  Non-station  ARY  Parallel  Nonlinear  Alg.). 
Given  the  initial  vectors  Ar  =  1, 2, . . . ,  a,  and  a  sequence  of  numbers  of  local 
iterations  9(^,A:),  ^  =  1,  2, . . . ,  Ar  =  1, 2, . . .,  a 


For  1,2,.. 

. ,  until  convergence 

if  k  ^  Jt 

Pl,k,g{e,k)  ■  ■ 

/a  \ 

■  ■  ■  Pt,k,2  •  Pt,k,l  j  ^ 

if  A:  G  Jt. 

(7) 

Note  that  Algorithm  2  computes  iterate  vectors  of  size  an,  while  it  only  uses 
77-vectors  to  perform  the  updates.  For  that  reason,  from  the  experimental  point 
of  view,  we  can  consider  that  the  sequence  of  iterate  vectors  is  made  up  of  that 

a 

n-vectors,  that  is,  £  =  1,2,....  Another  consequence  of  what  has 

j=i 

been  mentioned  above  is  that  only  components  of  the  vectors  corresponding 
to  nonzero  diagonal  entries  of  the  matrix  Ek  need  to  be  computed.  Then,  the 
local  storage  is  of  order  n  and  not  an. 

In  order  to  rewrite  the  asynchronous  iteration  (7)  more  clearly,  we  define  the 
operators  . . . ,  gL^^),  with  G^^^  :  IR""  -4  IR"  such  that,  if  y  G  IR“" 

=  Pi,k,q(t,k)  ■  •■■■  Pt,k,2  ■  Pl.kAQy),  k  =  1,2, . .  .,a, 

where 

Q  =  [F;i,...,.Efc,...,.&„]  eiR"’^"".  (8) 
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Then,  iteration  (7)  can  be  rewritten  as  the  following  iteration 


,.«)  _ 


r(f,l)) 


)■ 


if  k  i  Jt 
if  G  Ji- 


(9) 


In  Section  2,  we  study  the  convergence  properties  of  the  above  algorithms 
when  the  matrix  in  question  is  either  M-matrix  or  ^f-matrix.  The  last  section 
contains  computational  results  which  illustrate  the  behavior  of  these  algorithms 
on  two  distributed  multiprocessors.  In  the  rest  of  this  section  we  introduce  some 
notation,  definitions  and  preliminary  results. 

We  say  that  a  vector  a;  6  IR"  is  nonnegative  (positive),  denoted  i  >  0  (x  >  0), 
if  all  its  entries  are  nonnegative  (positive).  Similarly,  if  x,  y  €  IR”,  x  >  y  (i  >  y) 
means  that  x  -  y  >  0  (x  -  y  >  0).  Given  a  vector  x  e  IR",  |x|  denotes  the  vector 
whose  components  are  the  absolute  values  of  the  corresponding  components  of 
X.  These  definitions  carry  over  immediately  to  matrices. 

A  nonsingular  matrix  A  is  said  to  be  an  M-matrix  if  it  has  non-positive 
off-diagonal  entries  and  it  is  monotone,  i.e.,  >  O;  see  e.g.,  Berman  and 

Plemmons  [3]  or  Varga  [17].  Given  a  matrix  A  =  (a.j)  6  1R"’<",  its  comparison 
matrix  is  defined  by  (A)  =  (a.-,-),  a.,-  =  o.y  =  f  ^  j.  A  is  said  to 

be  an  /f -matrix  if  (A)  is  a  nonsingular  M-matrix. 


Lemma  1.  Let  . . . ,  . . .  be  a  sequence  of  nonnegative  matrices 

in  m.  .If  there  exists  a  real  number  0  <  0  <  1,  and  a  vector  v  >  0  in  IR", 
such  that 

<ev,  ^  =  1,2,.. ., 

then  p{Kt)  <e^  <  I,  where  lu  =  •  •  -/fd),  and  therefore  lim  = 

0  foo 


Proof  The  proof  of  this  lemma  can  be  found,  e.g.,  in  [15]. 

Lemma  2.  Let  A  =  (a,j)  G  IR"^"  be  an  H -matrix  and  let  ^  :  IR"  IR"  be  a 
continuous  and  diagonal  mapping.  If  sign(aii)  (t  -  s)  (^,-  (t)  -  (s))  >0,  i  = 

l,2,...,n,  for  all  t,s  6  IR,  then  the  almost  linear  system  (1)  has  a  unique 
solution. 

Proof  It  is  essentially  the  proof  of  [l,  Lemma  2]. 


2  Convergence 

In  order  to  analyze  the  convergence  of  Algorithm  1  we  rewrite  it  as  the  following 
iteration  scheme 


1,2,..., 


(10) 
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where  is  computed  according  to  the  iteration 

For  m  =  1  to  q(£,  k) 

arf''’'"  =  +  (1  -  I  <i<n,  u  eJR,  w  ^0  (11) 


and  is  determined  by 


(12) 


where  r,-  is  defined  in  (2)  and 

+  (1  -  +  6,  //  €  R 


The  following  theorem  ensures  the  existence  of  a  unique  solution  of  system 
(1)  and  shows  the  convergence  of  scheme  (10)  (or  Algorithm  1)  when  A  is  an 
H-matnx  and  0  <  (i  <  u  <  i  with  w  ^  0,  where  D  =  diag(A)  and 

A  =  D  —  B.  Note  that,  from  [17,  Theorem  3.10],  |D|  is  a  nonsingular  matrix  and 
p(|Zll-M5|)<  1. 

Theorem  1.  Let  A  =  D  -  -  Ui^k,m  =  D  -  B,  £  =  1,2,...,  k  - 

1, 2, . . . ,  o,  m=  1,2, . . .  ,q(£,  k),  be  an  H -matrix,  where  D  =  diag(.4)  and  Lc^k.m 
are  strictly  lower  triangular  matrices.  Assume  that  |5|  =,\Lt^k.m\  +  Let 

^  be  a  continuous  and  diagonal  mapping  satisfying 


sign{aii)  (t  -  s)  {$i(t) -4>i{s))  >0,  i-l,2,...,n,  for  all  t ,  s  E  IR.  (1.3) 

If  0  <  p  <  w  <  with  w  0,  where  p  =  p{\D\~'^\B\),  and  q{£,k)  >  1.  £  = 
1,2,...,  k  =  1,2,...,q:,  then  the  iteration  (10)  is  well-defined  and  converges 
to  the  unique  solution  of  the  almost  linear  system  (1),  for  every  initial  vector 


Proof.  Since  1  <  ?'  <  n,  are  continuous  mappings  satisfying  (13),  it  follows 
that  each  r,-  given  in  (2)  is  one-to-one  and  maps  IR  onto  IR.  Hence,  each  r,-  has 
an  inverse  function  defined  in  all  of  IR  and  thus  iteration  (10)  is  well-defined. 
On  the  other  hand,  by  Lemma  2,  system  (1)  has  a  unique  solution,  denoted 
X’*.  Let  =  x^^)  — -x*  be  the  error  vector  at  the  ^th  iteration  of  scheme  (10). 

Ot 

Then-,  Ek  —  x* I.  Using  (13)  and  reasoning  in  a  similar  way 

Ar=l 

as  in  the  proof  of  [13,  Theorem  13.13],  it  is  easy  to  prove  that  |an-|  |y  —  y|  < 
|>’t(y)  —  »’i(y)l,  for  all  2/,  y  G  IR,  where  r,-  is  defined  in  (2).  Therefore,  we  obtain 
that  |ai,|  ~  <  1-^  —  x\,  for  all  x,  z  G  IR.  Then,  from  (12)  and  using 

the  fact  that  i*  =  rf^([pLi^k,mX*-l-{l  -  p)  Li^k.mX* -l-Ue,k,mX*  +  b]i),  we  obtain, 
for  each  i  =  1, 2, . . . ,  n 

4|  =  knl -  r-Hzh\  < 

=  I  [pLc,k,m  (£'■*■'"  -  X*)  -t-  (1  -  A.)  Lc,k.m  -  x‘) 
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Since  these  inequalities  are  true  for  all  i  we  can  write 

\D\  -  a:*)  +  (1  _ 


Since  (PI  /r|L;  ^.,^1)  i  >  Q,  making  use  of  (11)  we  obtain,  after  some  algebraic 
manipulations,  that 


X  I  <  (PI  F\Lc,k,m\)  ^  {{u  -  F)\Lt^k,m\+(^\Ui^k,r 


+  |l-w|p|)|x^’*’'”  m=l,2,...,q{e,k). 


Therefore.  _  a-*  I  <  /ff’ _  a;*  (  ^here  -  H 

Hc.k^2  ■  HtL  and  '  '  '  ’’ 


(|Z)|  n\Li^k,m\)  ^  -  P)\Ll,k,m\+u\Ut^k,m\-\-\l-u\\D\)  .  (14) 


Then  |eW|  <  /fW|e(^-i)|  <  . . .  <  hW  . .  ./f(i)|£(o)|_  jj^t)  ^  ^ 

Since  ^  is  an  /f-matrix,  following  the  proof  of  [8,  Theorem  4.1]  we  conclude  that 
oi  U  <  //  <  w  <  with  w  ^  0,  there  exist  real  constants  0  <  <  1  and 

a  positive  vector  n  such  that  h\^\^  <  e^v.  Hence,  setting  9  =  jnax  Ok.  it 

obtains  <  Ov.  Then,  from  Lemma  1  the  product  •  •  ‘hW  tends 

to  the  null  matrix  as  f  oo  and  thus  lim  =  0.  Therefore,  the  proof  is  done. 


Next  we  show  the  convergence  of  the  asynchronous  Algorithm  2  under  similar 
nypotneses  as  in  the  synchronous  case. 


Theorem  2.  Let  A  =  D  -  -  Ue,k.m  =  D  -  B,  £  =  1,2,...,  k  = 

1,2,  ,Q  m_  l,2,...,q(e,k),  be  an  H -matrix,  where  D  =  diag(A)  and  Lf  k  ^ 
are  strictly  lower  triangular  matrices.  Assume  that  \B\  =  |L,,,,,„|  +  |fA,,,,„|.  Let 
oe  a  continuous  and  diagonal  mapping  satisfying  for  all  t,s  €  JR 


sign(aii)  {t  -  s)  (#,  (t)  _  («))  >  q,  ,!  =  1, 2, . . . ,  n. 


Assume  further  that  the  sequence  r[e,k)  and  the  sets  Jt,  k  ~  I  2  a  I  = 
12  satisfy  condMon,  (4-6).  JfO<y<^<^  „  ^’o,'  ,,  ^ 

P(|D|-  |B|).  t)  >  1,  f  =  1, 2 . *  =  1, 2, . . .,  a,  a™  (*e  synchronous 

Algorithm  2  is  well-defined  and  converges  to  x*  =  (a:*^ , . . . ,  x*'^ )^  G  IR“’* ,  where 
xis  the  unique  solution  of  the  almost  linear  system  (1),  for  all  initial  ’vectors 


4  G  IR",  k  =  1,2, 


,  (y. 


Pi  oof  By  Lemma  2,  the  existence  and  uniqueness  of  a  solution  of  system  (1) 
IS  gi^aranteed.  From  the  proof  of  Theorem  1  it  follows  that  Algorithm  2  is  well- 
tlmr^°  Moreover,  there  exists  a  positive  vector  i;  and  a  constant  0  <  0  <  1  such 

H[^\><9v,  h  =  l,2,...,a,  ^  =  1,2 . 


(15) 
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Let  us  consider  ti  =  . v'^f  €  m“".  As  =  j;*,  then  f*  is  a  fixed 

point  of  ^  =  1,  2, . . . .  Following  the  proof  of  Theorem  1  it  easy  to  prove 
that 

-  z\,  for  all  y,z  £  IR"", 

where  Q  is  defined  in  (8).  Then, 


-  z\,  for  all  y,z€  IR“", 


where 


J’(C)  _ 


Hi'^Q 


G  1R“ 


From  equations  (15)  and  (16)  it  follows  that 


(16) 


rWu<^t;,  ^=1,2,....  (17) 

Due  to  the  uniformity  assumption  (17),  we  can  apply  [2,  Theorem  1]  to  our 
case  m  which  the  operators  change  with  the  iteration  superscript.  Then,  the 
convergence  is  shown. 


Note  that,  in  the  particular  case  in  which  A  is  an  M-matrix,  condition  (13) 
is  reduced  to  state  that  the  mapping  ^  is  nondecreasing.  Moreover,  condition 

1^1  =  \Lc,k,m\  +  \lk,k,m\  is  equivalent  to  assume  that  Lc^k,m  and  Ui^k,m  are 
nonnegative  matrices. 


3  Numerical  experiments 

We  have  implemented  the  above  algorithms  on  two  distributed  multiprocessors. 
The  first  platform  is  an  IBM  RS/6000  SP  with  8  nodes.  These  nodes  are  120 
MHz  Power2  Super  Chip  (Thin  SC)  and  they  are  connected  through  a  high 
performance  switch  with  latency  time  of  40  microseconds  and  a  bandwidth  of 
30  to  35  Mbytes  per  second.  The  second  platform  is  an  ethernet  network  of  five 
120  MHz  Pentiums.  The  peak  performance  of  this  network  is  100  Mbytes  per 
second  with  a  bandwidth  around  6.5  Mbytes  per  second.  In  order  to  manage  the 
parallel  environment  we  have  used  the  PVMe  library  of  parallel  routines  for  the 
IBM  RS/6000  SP  and  the  PVM  library  for  the  cluster  of  Pentiums  [9],  [10]. 

In  order  to  illustrate  the  behavior  of  the  above  algorithms,  we  have  considered 
the  following  semilinear  elliptic  partial  differential  equation  (see  e.g.,  [7],  [16], 
[18]) 

-{K^u,)r-{lOuy)y  =  -ge-  (i,y)€f2, 
u=  -^-y"^  {x,y)  £dn, 

where 

=  K^{x,  y)-\  +  x'^  +  y'^, 

A''  =  j/)  =  1  +  e®  +  e*', 

g  =  g{x,  y)  =  2(2  +  Zx^  +  ^  +  (1  +  y)ey)e-^"-r  ^ 

=  (0,1)  X  (0,1). 
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It  is  well  known  that  this  problem  has  the  unique  solution  u{x,  y)  =  x-  +  y- . 
To  solve  equation  (18)  using  the  finite  difference  method,  we  consider  a  grid 
m  n  of  d-  nodes  equally  spaced  hy  h  =  Ax  =  Ay  =  This  discretization 
yields  an  almost  linear  system  Ax  +  ^(x)  =  b,  where  A  is  a  block  tridiagonal 
symmetric  matrix  A  -  (A-i, Tj,  AOiLi,  where  Ti  are  tridiagonal  matrices  of 
size  d  X  d,  i  —  1,2, . .  .,d,  and  Di  are  d  x  d  diagonal  matrices,  i  =  1  d  -  1  ■ 
see  e.g.,  [7],  ’  ’ 

Let  5  —  {1,  2, . . .,  n)  and  let  n*.  A;  =  1, 2, . . . ,  a,  be  positive  integers  which 
add  n.  Consider  Sk,m,  A  =  1, 2, . . . , a,  m  =  1,2, . .  .,q{£,  k),  subsets  of  5  defined 
as 

Sk.m  =  {4,m.  +  1,  •  .  .  ,  sl  ^} ,  (19) 

where 

{Sk,m  =  max{l,l  +  ^n,-  -  W  -  (m-  l)d},  and 

i<k 

=  mm{n,^ni  +  bd+  {m-l)d}, 

t<k 

Q 

with  b  being  a  nonnegative  integer.  Note  5  =  [J  Sk,m,  m  =  1,  2, . . . ,  g(e,  k). 

Let  us  further  consider  multisplittings  of  the  form 

{D  -  Lt^k.m ,  Ue,k.m,  Ek}k=i  >  where  L(^k,m  =  Lk  m  =  I  ^  ^  ^ 

[0  otherwise, 

^'c.k.m  =  Uk,Tn,  for  all  f  =  1, 2, . . . .  The  nxn  nonnegative  diagonal  matrices 
Ek.  ]  Q,  are  defined  such  that  their  ith  diagonal  entry  {Ek)ii  is  calculated 

as  follows 

I”  1  if  1  6  5*,i  and  i  ^  Sj^i,  j  k, 
iEk)ii  =  <  0.5  if  i  £  Sk,i  n  Sk-i,i  or  i  £  Sk  i  C)  Sk+i  i, 

•  lO  ifi^5i.i. 

Experiments  were  performed  with  almost  linear  systems  of  different  orders 
The  conclusions  were  similar  for  all  tested  problems.  Here  we  discuss  the  results 
obtained  with  d  =  64  and  d  =  200,  that  originate  almost  linear  systems  of  sizes 
4096  and  40000  respectively.  In  this  paper  all  the  times  obtained  for  the  parallel 
algorithms  correspond  to  REAL  times;  moreover,  they  are  reported  in  seconds 
The  initial  vector  used  was  =  (!,...,  1)^.  The  stopping  criterion  used  for  the 
almost  linear  sy,stem  of  size  4096  was  where  IMI,  is  the  Euclidean 

norm  and  v  is  the  vector  which  entries  are  the  values  of  the  exact  solution  of 

/  •  hj  =  1, . .  .,d.  However,  for  the  almost  linear  svstem 

01  size  40000  the  convergence  criterion  was  changed  to  <'l0•■^ 

On  the  other  hand,  to  solve  the  one-dimensional  nonlinear  equation  (3)  the 
Newton  method  is  used.  The  best  results  were  obtained  performing  onlv  one 
Iteration  of  this  method. 

^  linear  system  of  size 

4096,  setting  w  -  //  =  1  in  Algorithm  1  and  using  different  multisplittings 
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depending  on  the  number  of  processors  used  (a)  and  on  the  choice  of  the  values 
Ilk,  I  <  k  <  Q.  Moreover,  in  this  table,  the  case  in  which  the  splittings  do 
not  change  with  the  local  iterations  (i.e.,  Lk^m  =  Lk,i,  m  =  1,2, ..  .,q(C,k)) 
is  analyzed  together  with  the  case  in  which  the  splittings  change  according  to 
(19)  and  (21).  No  overlapping  is  considered,  that  is,  the  integer  b  in  (20)  is 
taken  as  zero.  It  is  observed  that  when  the  splittings  change,  the  number  of 
global  iterations  decreases.  Therefore,  the  communications  among  processors 
are  reduced  and  less  execution  time  is  observed. 


1  Without  varyinjz; 

the  splittings 

IVarymfc  the 

splittings  1 

1  nk 

It, 

Time 

Cluster 

Time 

SP2 

1 

Time 

Cluster 

Time 

SP2 

2 

1.1 

4292 

70.46 

11.82 

4292 

70.46 

11.82 

2.2 

2161 

64,21 

10.53 

2144 

59.42 

10.29 

2048 

4,4 

1091 

58,11 

10.14 

1065 

53.79 

9.54 

8,8 

552 

58.74 

10.36 

532 

53.69 

9.85 

3,2 

1802 

70.45 

11.81 

1786 

68.72 

11.89 

2 

1»1 

4266 

in. 4 

15.95 

4266 

111.4 

15.95 

VIPS 

3,3 

1435 

85.84 

14.11 

1416 

83.57 

13.52 

2880 

8,8 

545 

81,55 

14.16 

530 

70.71 

12.83 

_ 

10.9 

476 

80.04 

13.98 

464 

69.35 

12.60 

A 

1,1, 1,1 

4391 

57.82 

7.73 

4391 

57.82 

7.73 

3,3, 3, 3 

1499 

39.50 

5.98 

1447 

37.33 

5.92 

1024 

4,4 ,4 ,4 

1113 

34.94 

6.01 

1081 

35.12 

5.94 

1024 

3,4,4, 3 

1206 

38.79 

6.46 

1152 

36  14 

6.30 

1024 

8, 8.8.8 

580 

40.08 

6,82 

536 

36.57 

6.64 

4 

1,1, 1,1 

4418 

62.84 

7.77 

4418 

62.84 

7.77 

1216 

3, 3, 3, 3 

1513 

44.27 

6.55 

1453 

41.89 

6.42 

832 

4, 4. 4, 4 

1145 

42.38 

6.51 

1085 

39.49 

6.30 

832 

3.4,4, 3 

1256 

36.94 

5.81 

1195 

34.52 

5.62 

|l216 

4,5, 5, 4 

989 

36.69 

5.89 

930 

33.89 

5.61 

Table  1.  Non-stationary  synchronous  models  without  overlap.  Size  of  the  almost  linear  system 


One  can  observe  that  the  number  of  iterations  of  the  non-stationary  algo¬ 
rithms  decreases  when  the  parameters  q{£,  k)  are  increased.  Furthermore,  if  the 
decrease  in  the  number  of  global  iterations  balances  the  realization  of  more  local 
updates  then,  less  execution  time  is  observed.  Note  that  when  q(C,k)  =  1  and 
the  splittings  do  not  change  with  the  local  iterations,  the  method  reduces  to 
the  parallel  nonlinear  Gauss-Seidel  method  (see  [18])  and  as  it  can  be  appreci¬ 
ated  the  non-stationary  parallel  methods  are  generally  better  than  the  parallel 
nonlinear  Gauss-Seidel  method. 

On  the  other  hand,  it  is  interesting  to  compare  the  parallel  results  of  Table 
1  with  the  results  of  the  well-known  one-step  Gauss-Seidel  Newton  method  [13], 
The  latter  performs  4196  iterations  and  the  CPU  time  in  the  IBM  RS/6000  SP 
computer  was  20.09  seconds.  So,  we  calculated  the  speed-up  setting  as  sequen- 

algorithm  that  method,  that  is,  we  have  considered  Speed-up= 
one-step  GS  Newton  algorithm  t  •  j.  ^  •  i  ,  , 

HEAL  time  of  parallel  algorithm  *  context,  it  is  observed  that 

we  obtain  parallel  non-stationary  algorithms  such  that  processors  can  achieve 
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between  84  %  and  105  %  of  efficiency  ( _ gpeed-up  ,  . 

.  .  "  'processors  s  number /  "^ben  it  uses  two 


' processors 's  number'  uses 

processors  and  between  62  %  and  89  %  of  efficiency  using  four  processors. 


Table  2.  Non-stationary  synchronous  models  with  overlap.  Size  of  the  almost  linear  system:  «96 


The  above  conclusions  are  independent  of  the  computer  used.  However,  the 
network  of  the  cluster  is  very  slow  compared  to  the  network  of  the  other  com- 

the  IBM  RS/6000  SP  computer,  there  is  a  significant  difference  between  these 
two  times  in  the  cluster.  In  the  rest  of  this  section  all  the  numerical  experiments 
have  been  run  in  the  IBM  RS/6000  SP  multiprocessor. 

Table  2  illustrates  the  influence  of  the  overlap  according  to  different  choices 
of  the  overlapping  level  6=1,2, 3;  see  (20).  Note  that  the  parameter  6  indicate.s 

f  splitting  assigned  to  a  processor  k,  k  =  1,2 . q,  has  an  overlap 

of  26  blocks  (each  one  of  size  d)  with  the  splittings  assigned  to  the  proces.sors 
^  and  /£  +  1.  The  conclusions  are  similar  to  those  presented  in  Table  1 
However,  it  is  observed  that  while  the  number  of  iterations  decreases  when  the 
overlap  increasesle,  this  decrease  does  not  get  less  execution  time.  This  is  due  to 
he  increase  of  the  number  of  operations  performed  at  each  processor  and  the 
increase  of  the  communications  among  processors. 

Now,  w'e  report  in  Table  3  results  of  non-stationary  methods  for  the  almost 
mear  system  of  size  40000.  Moreover,  we  have  calculated  for  each  method,  the 
error  ||r  -  vljo.  As  it  can  be  appreciated  when  the  non-stationary  parameters 
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IWithout  varying  th$  splittingsll  Vamne  the 

splittincs  1 

It. 

Mam 

■ 

E31 

■1 

1,1, 1,1 

40983 

iKUsifca 

IIEIinKKi 

blME 

14961 

611.29 

h.00012 

14799 

11484 

598.31 

11318 

585.85 

0.00010 

^ _ 1 _ 

1,1, 1,1 

lEBHiH 

845.37 

iHl  i  i 

40757 

845.37 

616.90 

0.00025 

0  00012 

14820 

622.18 

iliSS 

14741 

11365 

607.84 

IHwJil 

606.88 

0.00010 

1.1, 1,1, 1,1 

41266 

608.52 

[■nil  — 

lEisma 

608.52 

437.73 

|n*  =  6800,;?  =  1,6 

3, 3, 3,3, 3, 3 

15120 

446.33 

0.00012 

14874 

5, 5, 5, 5.5, 5 

9474 

433.37 

9223 

415.58 

ill  1  ifiKfl 

4,5, 5, 5,5,4 

9622 

9369 

413.59 

ill  1 

1,1. 1,1, 1,1 

40925 

40925 

636.56 

bhiiiiiKiial 

Ilk  =  6800,/;  =  1,6 

3,3,3, 3, 3,3 

14906 

463.43 

14786 

453.39 

0.00015  II 

5, 5, 5, 5, 5, 5 

9318 

440.76 

0.000094 

9183 

434.38 

k  <  5  1 

3,4.4, 4,4, 3 

11679 

449.56 

0.00010 

11548 

445.59 

iSi 

Table  3.  JVon-stationary  synchronous  models  without  and  with  overlap.  Size  of  the  almost  linear 


g{C,k)  increase  the  error  is  reduced  and  therefore  the  approximation  to  the  so¬ 
lution  of  the  semilinear  elliptic  partial  differential  equation  (18)  is  better.  The 
remaining  conclusions  are  similar  to  those  obtained  for  the  almost  linear  system 
of  size  4096. 

Figure  1  illustrates  the  influence  of  the  parameters  w  =  ^  1  for  different 

overlapping  levels,  6  =  0, 1,2  when  Algorithm  1,  varying  the  splittings,  is  used. 
These  results  correspond  to  the  problem  of  size  4096  when  we  use  four  processors 
and  the  multisplitting  is  defined  by  iik  =  1024,  k  =  1,2, 3, 4.  As  it  can  be 
observed,  in  a  neighborhood  of  the  optimum  relaxation  parameter  w  the  models 
with  overlap  are  better  than  the  corresponding  non  overlapped  one.  We  note 
that  similar  results  were  obtained  without  varying  the  splittings. 

To  finish  this  section  we  consider  two  different  implementations  of  asyn¬ 
chronous  Algorithm  2.  In  the  first  implementation  we  consider  a  processors 
connected  to  a  host  processor,  where  a  is  the  number  of  splittings.  Thus,  we 
use  Q  -f-  1  processors.  The  role  of  the  host  processor  is  to  receive,  in  an  asyn¬ 
chronous  way,  the  approximation  computed  by  other  processors,  to  update  the 
global  approximation  and  to  send  it  to  the  corresponding  processor. 

In  the  second  implementation  we  use  o  processors,  as  many  as  splittings. 
Then,  one  of  the  processors,  we  assume  the  first  one,  has  to  compute  the  ap¬ 
proximation  corresponding  to  one  of  the  splittings  and,  moreover,  it  has  to  take 
the  role  of  host  processor.  For  this  purpose,  in  the  process  executed  bv  this 
processor  we  have  intercalated  some  PVMe  calls  between  the  sentences  wdiich 
compute  its  approximation.  This  allows  us  to  check  if  the  approximations  of  some 
of  the  other  processors  have  arrived.  In  this  case  the  host  processor  executes  the 
following  tasks, 

1 .  it  stops  the  calculation  of  its  approximation  and  it  receives  the  approximation 

or  approximations  sent  by  other  processors, 

2.  it  updates  the  global  approximation, 
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Fig.  1.  Companson  synchronous  parallel  models  for  different  overlapping  levels.  Non- 
stationary  parameters  q{e,  k)  =  4,  £=  1,2, ...  ,k  =  1,2,3,  4. 


■3.  if  the  stopping  criterion  is  not  satisfied,  then  it  sends  the  updated  approxi¬ 
mation  to  the  corresponding  processors,  and  finally 
4.  it  continues  computing  its  approximation. 

Note  that  in  this  implementation  there  are  some  waiting  times. 

Figures  2  and  3  illustrate,  respectively,  the  behavior  of  the  above  as,yn- 
chronous  implementations  of  Algorithm  2.  In  these  figures,  the  multisplitting 
1  makes  reference  to  the  one  obtained  from  Uk  =  1024,  k  =  1,2, 3, 4,  while  the 
multisplitting  2  corresponds  to  the  values  rii  ~  tia  =  1216,  =  ns  =  832. 

Overlap  and  variation  of  the  splittings  are  not  considered  in  these  figures.  The 
conclusions  were  similar  to  those  of  the  synchronous  models  whether  the  overlap 
and  variation  of  the  splittings  were  considered  or  not.  However,  these  asyn¬ 
chronous  implementations  have  not  accelerated  the  convergence.  This  is  due  to 
the  fact  that  in  the  asynchronous  implementations  of  our  example  the  com¬ 
munications  increase  compared  to  the  synchronous  ones,  while  the  number  of 
operations  performed  remains  of  the  same  order.  This  is  specially  problematic 
when  a  distributed  memory  multiprocessor  is  used. 
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Abstract.  A  system  organized  as  a  Hierarchy  of  Processor-And-Memory 
(HPAM)  extends  the  familiar  notion  of  memory  hierarchy  by  including 
processors  with  different  performance  in  different  levels  of  the  hieretrchy. 
Tasks  are  assigned  to  different  hierarchy  levels  according  to  their  degree 
of  parallelism.  This  paper  studies  the  spatial  loccility  (with  respect  to 
degree  of  parallelism)  behavior  of  simulated  parallelized  benchmarks  in 
multi-level  HPAM  systems,  and  presents  an  inter-level  cache  coherence 
protocol  that  supports  inclusion  and  multiple  block  sizes  on  an  HPAM 
architecture.  Inter-level  miss  rates  and  traffic  simulation  results  show 
that  the  use  of  multiple  data  transfer  sizes  (as  opposed  to  a  unique  size) 
across  the  HPAM  hierarchy  allows  the  reduction  of  data  traffic  between 
the  uppermost  levels  in  the  hierarchy  while  not  degrading  the  miss  rate 
in  the  lowest  level. 


1  Introduction 

The  Hierarchical  Processor-And-Memory  (HPAM)  architecture  [15]  has  been 
proposed  as  a  cost/effective  approach  to  parallel  processing.  The  HPAM  concept 
is  based  on  a  heterogeneous,  hierarchical  organization  of  resources  that  is  similar 
to  conventional  memory  hierarchies.  However,  each  level  of  the  hierarchy  has  not 
only  storage  but  also  processing  capabilities.  Assuming  that  the  top  (i.e.  first) 
level  of  the  hierarchy  is  the  fastest,  any  given  memory  level  is  extended  with 
processors  that  are  slower,  less  expensive  and  in  larger  number  than  those  in  the 
preceding  level.  Figure  1  depicts  a  generic  3-level  HPAM  machine. 

The  mapping  of  an  application  to  an  HPAM  system  is  based  on  the  degrees 
of  parallelism  that  the  application  exhibits  during  its  execution.  Each  level  of 
an  HPAM  hierarchy  handles  portions  of  code  whose  parallelism  degree  is  within 
a  certain  range.  Levels  with  large  number  of  slow  processors  and  large  memory 
capacity  (bottom  levels)  are  responsible  for  the  highly  parallel  fractions  of  an 
application,  whereas  levels  with  small  number  of  fast  processors  and  memories 
are  responsible  for  the  execution  of  sequential  and  moderately  parallel  code. 

An  HPAM  machine  exploits  heterogeneity,  computing-in-memory  and  local¬ 
ity  of  memory  references  with  respect  to  degree  of  parallelism  to  provide  supe¬ 
rior  cost/performance  over  conventional  homogeneous  multiprocessors.  Previous 
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Fig.  1.  Processor  and  memory  organization  of  a  3-level  HPAM 


studies  [14  15]  have  empirically  established  that  applications  exhibit  temporal 
locality  with  respect  to  degree  of  parallelism.  This  paper  extends  these  studies  bv 
empirically  establishing  that  applications  also  exhibit  spatial  data  locality.  Fur- 

HP  A  M multiple  transfer  sizes  across  an 
HPAM  hierarchy  with  more  that  two  levels  in  inter-level  miss  rates  and  traffic 
io  this  end  this  paper  proposes  an  inter-level  coherence  protocol  that  supports 
inclusion  and  multiple  block  sizes  across  the  hierarchy. 

The  data^  locality  studies  have  been  performed  through  execution-driven 
simulation  of  parallelized  benchmarks.  Several  benchmarks  from  three  differ- 
ent  suites  (CMU  [9],  Perfect  Club  [7]  and  Spec95  [11])  have  been  instrumented 
to  detect  do-loop  parallelism  with  the  Polaris  [12]  parallel  compiler.  The  stream 
o  memory  references  generated  by  a  benchmark  is  simulated  by  a  multi-level 
memory  hierarchy  simulator  that  implements  the  proposed  inter-ievel  coherence 
protocol  to  obtain  measurements  of  data  locality,  data  traffic  and  invalidation 
traffic  at  each  HPAM  level. 

The  rest  of  this  paper  is  organized  as  follows.  Section  2  introduces  the  HPAM 
machine  model  and  discusses  coherence  protocols  for  an  HPAM  architecture  Sec- 
Hon  3  presents  the  methodology  used  to  perform  the  data  locality  studies  and 
Section  4  presents  simulation  results  and  analysis  of  the  locality  behavior  of  ap¬ 
plications  and  of  the  proposed  inter-level  coherence  protocol.  Section  5  concludL 
the  paper. 


2  HPAM  Architecture 

A  hierarchical  processor-and-memory  (HPAM)  machine  is  a  heterogeneous,  mul¬ 
tilevel  parallel  computer.  Each  HPAM  level  contains  processors,  memorv  and 
interconnection  network.  The  speed  and  number  of  processors,  latency  and  ca- 
paci  y  of  memories  and  network  differ  between  levels  in  the  hierarchy'  The  fol¬ 
lowing  characteristics  hold  across  the  different  HPAM  levels  from  top  lo  bottom: 
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individual  processor  performance  decreases,  number  of  processors  increases,  and 
memory /net work  latency  and  capacity  increeise. 

Tasks  are  assigned  to  HPAM  levels  according  to  their  degree  of  parallelism 
(DoP).  Highly  parallel  code  fractions  of  an  application  are  assigned  to  bottom 
levels,  where  large  number  of  processors  and  large  memory  capacity  are  avail¬ 
able,  while  sequential  and  modestly  parallel  fractions  are  assigned  to  top  levels, 
where  a  small  number  of  fast  processors  and  memories  are  available. 

The  HPAM  approach  to  computing-in- memory  bears  similarities  with  IRAM 
and  PIM  efforts  [2, 10],  but  it  relies  on  heterogeneity  and  locality  with  respect  to 
degree  of  parallelism  to  build  a  hierarchy  of  processor-and-memory  subsystems 
where  each  subsystem  is  designed  to  be  cost-efficient  in  its  parallelism  range. 
A  massively  parallel  system  implemented  with  dense,  relatively  inexpensive  and 
slow  memory  technology,  can  be  very  efficient  for  highly  parallel  code,  while  a 
tightly-coupled  symmetric  multiprocessor  containing  a  smaller  amount  of  fast, 
expensive  memory,  can  be  very  efficient  for  code  mostly  sequential  or  with  moder¬ 
ate  parallelism.  The  merging  of  these  systems  under  the  HPAM  concept  provides 
a  cost-efficient  solution  for  applications  with  different  levels  of  parallelism. 


2.1  Inter-Level  Coherence  Protocols 


A  distributed  shared-memory  (DSM)  implementation  of  HPAM  is  assumed  in 
this  paper.  Each  level  of  such  shared-memory  HPAM  machine  relies  on  caching 
of  data  from  remote  levels  to  reduce  inter-level  bandwidth  requirements  and  im¬ 
prove  remote  access  latency.  Therefore,  cache  coherence  has  to  be  enforced  both 
inside  an  HPAM  level  and  among  different  levels.  Cache-coherence  solutions  that 
use  a  combination  of  different  protocols  (snoopy  and  directory-based)  have  been 
proposed  and  implemented  [6]  for  homogeneous  DSMs.  and  can  be  reused  in  an 
HPAM  context.  However,  an  HPAM  machine  can  take  advantage  of  coherence 
solutions  that  exploit  its  heterogeneous  nature. 

In  this  paper,  the  potential  advantages  of  having  multiple  line  sizes  across  the 
hierarchy  are  studied.  Similar  to  conventional  uniprocessor  memory  hierarchies, 
multiple  line  sizes  in  an  HPAM  context  can  provide  low  miss  rates  in  the  bottom 
levels  of  the  hierarchy  while  not  sacrificing  traffic  and  miss  penalty  in  the  upper 
hierarchy  levels.  In  this  section,  a  coherence  protocol  that  allows  requests  to  be 
generated  by  any  level  of  the  hierarchy  (as  opposed  to  a  conventional  memory 
hierarchy,  where  all  accesses  are  generated  in  the  topmost  level),  and  supports 
both  inclusion  and  multiple  line  sizes,  is  described.  Similar  to  the  MESI  coher¬ 
ence  protocol  [8],  the  protocol  assigns  one  of  four  states  to  each  memory  block 
and  relies  on  invalidations  to  maintain  coherence. 

Let  the  hierarchical  organization  have  h  levels,  where  level  1  is  the  top  level 
and  h  is  the  bottom  hierarchy  level.  Let  Isj  be  the  line  size  (also  referred  to  a.s 
block  size)  in  level  i.  All  data  transfers  between  adjacent  levels  i  and  i  +  1  have 
size  /s,  .  Let  Bj{i)  be  block  i  in  level  j,  where  ?'  >  0  and  I  <  j  <  h.  Assume  that 
block  sizes  across  the  hierarchy  satisfy  the  relation: 
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Given  this  alignment,  let  the  sub-blocks  in  level  k  of  a  block  Bj(i)  be 
defined  as: 


ISi 


ISi 


ISi 


Isj 


ISk 


Isk 


and  the  unique  superblock  in  level  I  of  I  >  j,  be  defined  as: 


^  (3) 

For  the  proposed  inter-level  coherence  protocol,  blocks  can  be  in  anv  of  the 
following  four  states: 


Invalid  (I):  data  in  the  block  is  entirely  non- valid 

-  Accessible  (A):  data  in  the  block  is  valid  and  may  be  shared  {read-only) 
by  one  or  more  processors 

-  Reserved  (R):  data  in  the  block  is  the  only  valid  copy  in  the  hierarchy 
Partially  Invalid  (P):  at  least  one  sub-block  of  the  memory  block  is  out¬ 
dated  (due  to  a  write  in  an  upper-level  memory) 

Memory  access  operations  consist  of  read  and  write  commands  that  can  be 
issued  from  any  level  j  of  the  hierarchy.  These  operations  are  considered  atomic. 
The  read/write  commands  are  defined  in  terms  of  four  primitive  coherence  op¬ 
erations,  as  follows:  (the  algorithms  used  to  implement  these  basic  inter-level 
coherence  operations  are  defined  in  Appendix  A) 

-  ULI(5j(i))( Upper-Level  Invalidate):  Invalidates  all  sub-blocks  of  Bj{i) 
ULW(Sj(i))(Upper-Level  Writeback):  Writes  back  dirty  data  to  Bj{i) 
from  upper-level  sub-blocks;  sets  sub-blocks  to  Accessible  (read-only) 

-  LLP(Rj(i))(Lower-Level  Partial-Invalidate):  Sets  all  superblocks  of  Bj{i) 
Partially  Invalid 

-  LLR(Sj(i))(Lower-Level  Read):  Fetches  block  Bj{i)  from  lower-level  su¬ 
perblocks;  sets  super-blocks  to  Accessible  (read-only) 

The  coherence  protocol  implements  read/write  operations  as  combinations  of 
these  four  primitives.  In  order  to  allow  for  multi-level  inclusion,  i.e.,  (definition 
here),  the  coherence  protocol  enforces  the  following  properties: 

1.  If  a  block  Bj{i)  is  Partially  Invalid,  then  all  of  its  superblocks  must  also  be 
Partially  Invalid 

2.  If  a  block  B,  (?)  is  Invalid,  then  all  of  its  sub-blocks  must  also  be  Invalid 

3.  If  a  block  Bj{i)  is  Reserved,  then  all  of  its  sub-blocks  must  be  Invalid  and 
all  of  its  superblocks  must  be  Partially  Invalid 

4.  If  a  block  Bj{i)  is  Accessible,  then  all  of  its  sub-blocks  are  either  Invalid  or 
Accessible,  and  all  of  its  superblocks  are  either  Partially  Invalid  or  Accessible 
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Algorithms  for  the  coherent  write  and  read  operations  of  a  block  Bj{i)  in  level  j 
are  presented  in  Appendix  A.  Property  1  allows  the  coherence  controller  to  fetch 
the  most  recent  copy  of  a  block  Bj{i)  if  any  sub-block  of  it  has  been  modified 
by  an  upper-level  processor  before  completing  a  read  or  write  request.  Property 
3  ensures  that  a  processor  can  complete  a  write  to  block  Bj{i)  when  the  state 
of  Bj(i)  is  Reserved  without  involving  other  processors  of  the  write,  since  it  is 
the  only  processor  that  has  a  valid  copy  of  the  block. 

An  example  of  the  inter-level  coherence  protocol  operation  on  a  3-level  con¬ 
figuration  is  shown  in  Figure  2.  Each  memory  block  is  represented  in  this  figure 
by  both  its  state  (gray-shaded  boxes)  and  contents  of  each  of  its  sub-blocks. 
Note  that  the  block  sizes  differ  among  the  levels;  a  block  in  level  2  is  twice  larger 
than  a  block  in  level  1  and  four  times  larger  than  a  block  in  level  0. 

The  example  begins  with  the  configuration  of  Figure  2(a):  the  bottom  level 
has  valid  data  in  the  Reserved  state,  and  the  other  levels  have  invalid  data.  Level 
0  then  issues  a  memory  read  of  block  So(l)-  The  protocol  issues  a  lower-level 
read  primitive,  bringing  a  sub-block  of  level  2  to  level  1,  containing  values  x 
and  y,  then  a  subblock  of  level  1  to  level  0,  containing  the  desired  data  (y).  All 
blocks  involved  in  this  transaction  become  Accessible  (Figure  2(b)). 

, Suppo.se  level  1  issues  a  write  to  block  Si(0)  and  let  t  and  u  denote  the 
new  contents  of  the  respective  sub-blocks.  The  protocol  handles  this  request  by- 
invalidating  upper  level  sub-blocks  (Bo(0)  and  So(l)),  setting  the  lower  level 
superblock  Bi>(0)  to  Partially  Invalid  and  setting  the  state  of  5i(0)  to  Reserved 
(Figure  2(c)). 

The  next  memory  reference  in  this  example  is  a  read  of  block  5o(3)  (Fig¬ 
ure  2(c)).  Similarly  to  the  first  read,  data  is  brought  from  level  2  to  level  1.  then 
to  level  0.  The  states  of  the  blocks  in  levels  0  and  1  become  Accessible.  However, 
the  state  of  the  block  50(0)  in  level  2  remains  Partially  Invalid  to  flag  that  at 
least  one  of  its  sub-blocks  (5i(0)  in  this  case)  contains  data  that  needs  to  be 
written  back,  as  Figure  2(d)  shows. 

The  last  memory  reference  of  this  example  is  a  read  of  So(0).  This  reference 
generates  a  write-back  request  to  the  Reserved  sub-block  5i(0)  in  the  upper 
level.  The  Reserved  block  in  level  1  becomes  Accessible  (Figure  2(e)).  This  as¬ 
sumes  the  existence  of  state  bits  associated  with  each  sub-block  of  the  adjacent 
upper  level.  The  implementation  of  read  and  write  fences  for  relaxed  consis¬ 
tency  models  requires  that  the  coherence  protocol  handles  acknowledgments  of 
all  messages  exchanged  between  adjacent  levels  (not  shown  in  the  primitives  of 
Appendix  A).  The  completion  of  an  operation  (read  or  write)  issued  by  proces¬ 
sor  P  with  respect  to  all  other  processors  is  signalled  by  the  arrival  of  positive 
acknowledgment  from  upper  and  lower  adjacent  levels. 


3  Simulation  Models  and  Methodology 

The  locality  studies  presented  in  this  paper  are  based  on  a  generic  four-level 
HPAM  architecture.  Each  level  is  labelled  from  0  to  3,  where  0  represents  the 

5 


295 


FEUP  -  Faculdade  de  Engenharia  da  Universidade  do  Porto 


00  0 
wrilvKl 

A 

(H)  i 

i  ! 

1 

rr 

0 

J 

1  f 

Tj 

4  — 

rn  f 

...i..  1  •  1  tzi] 

y 

II. 

L,  1  -  1  1 

[ 

«  t  y  I  ^  I  w 


A 


«  I  y  \  '  I  w 


(h) 


l«ve)  2 


R 

A  /. 

A  w 

1  ^ 

I  1  u 

/,  1  w 

1  ,  U' 

1  /  1  w 

n«l  H2((l|  <1  i 

1 _ _ lj 

!  1  i  !' 

A 

1  ^  1  )' 

I. 

/  1  w  1 

1  1.'  U’  1 

t  1  w  ^ 

Fig.  2.  Example  of  coherence  protocol  I  for  3-level  configuration.  Line  states  are  shown 
in  the  cop  of  each  box:  P  (Partially  Invalid),  1  (Invalid),  A  (Accessible)  and  R  (Reserved) 
and  contents  of  blocks  eire  shown  below  the  state. 


topmost  level  (smallest  degree  of  parallelism).  Level  i  in  the  hierarchy  is  respon¬ 
sible  for  executing  fractions  of  code  which  hav'e  degree  of  parallelism  greater 
or  equal  to  DoF,  and  less  than  DoF,+i  (or  infinity  for  the  bottom  level).  The 
degrees  of  parallelism  in  this  study  were  fixed  as  powers  of  ten,  DoFn  =  1, 
=  10^  DoPn  —  100  and  D0P3  =  1000.  The  variables  used  in  the  locality 
experiments  in  this  paper  are  defined  as  follows: 

-  Line  size  of  level  i  (/s,  ):  represented  in  log^  notation  in  this  paper  (i.e.  /s,-  =  8 
means  a  line  size  of  2®  =  256  bytes) .  This  parameter  determines  the  amount 
of  data  that  is  transferred  to  level  i  when  a  data  miss  occurs  in  this  level.  It 
is  assumed  that  each  level  has  an  inter-level  cache  that  is  shared  bv  all  level 
processors.  The  shared  caches  at  each  level  are  assumed  to  be  ideal  (fully 
associative  and  infinite)  in  the  simulations  performed. 

Data  miss  rate  at  level  i  (mi'i):  percentage  of  memory  references  (loads  and 
stores)  in  level  i  that  miss.  Misses  in  level  i  can  be  serviced  either  by  disk 
(cold  misses)  or  by  another  level  j,  i^j.  The  latter  case  will  be  referred 
to  as  inter-level  misses  throughout  this  paper.  Intra-level  mis.ses,  which  are 
serviced  by  processors  in  the  same  level,  are  not  modeled. 

-  Data  traffic  between  levels  i  and  f  -I-  1  (<?■,•, i+i):  aggregate  amount  of  data 

tiansferred  between  levels  i  and  ? -b  1.  Inter-level  communication  is  assumed 
to  occur  between  adjacent  levels  only.  Therefore,  accounts  not  only 

for  the  amount  of  data  exchanged  between  levels  i  and  f  -(-  1 ,  but  also  for 
any  data  transfer  from/to  level  j,  j<i  to/from  level  k,  k>i  -)-  1. 
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The  simulation  methodology  combines  compiler-assisted  parallelism  identi¬ 
fication  with  execution-driven  simulation  of  benchmarks.  An  application  under 
study  is  first  instrumented  with  the  Polaris  [12]  source-to-source  parallelizing 
compiler  to  detect  do-loop  parallelism.  The  instrumented  Fortran  code  has  tags 
inserted  in  the  beginning  and  end  of  each  loop  that  indicate  the  degree  of  par¬ 
allelism  of  the  loop.  The  Polaris-generated  code  is  then  compiled,  and  the  ex¬ 
ecutable  code  is  used  as  input  to  an  execution-driven  simulator.  The  simulator 
models  a  multi-level,  shared-memory  hierarchy,  and  is  built  on  top  of  Shade  [1], 
The  simulator  engine  traces  each  memory  data  access  during  program  exe¬ 
cution.  The  engine  identifies  the  level  that  issues  each  access  bv  comparing  the 
current  degree-of-parallelism  tag  (inserted  in  the  instrumentation  phase)  with 
the  parallelism  thresholds  I?oP,  ..  The  memory  access  is  forwarded  to  the  appro¬ 
priate  level  cache  handler,  which  characterizes  the  access  either  as  a  hit  (data 
has  previously  been  in  the  level’s  cache)  or  a  miss  (either  the  data  is  present 
in  another  level's  cache  or  needs  to  be  fetched  from  disk).  Coherence  me.ssages 
are  sent  by  the  cache  handler  to  other  level  caches  on  mis.ses,  according  to  the 
cache  coherence  protocol  under  use.  Hence,  the  inter-level  coherence  protocol 
behavior  is  modeled.  However,  the  intra-level  coherence  protocol  is  not  modeled. 
The  miss  rate  and  traffic  results  obtained  with  such  model  are  therefore  opti¬ 
mistic,  since  ideal  caches  and  intra-level  sharing  are  assumed.  Nonetheless,  this 
simplified  model  is  able  to  capture  the  inherent  spatial  locality  with  respect  to 
degree  of  parallelism  of  the  applications  under  study.  Miss  rates  degrade  when 
finite  caches  and  intra-level  sharing  are  considered,  but  locality  with  respect  to 
degree  of  parallelism  is  still  evident  for  non-ideal  memory  systems  [3] . 

The  following  benchmarks  from  the  CMU,  Perfect  Club  and  Spec9.5  suites 
have  been  used  in  the  spatial  locality  studies:  Radar,  Stereo,  FFT2  and  Airshed 
(CMU  Parallel  Suite  [9]);  TRFD,  FL052,  ARC2D,  OCEAN  and  MDG  (Perfect 
Club  Suite  [7]);  Hydro2d  and  Swim  (Spec95  Suite). 


4  Simulation  Results  and  Analysis 

For  each  benchmark,  simulations  have  been  performed  for  various  line  sizes  (/s,  ), 
and  measurements  of  miss  rates  (jnr.)  and  data  traffic  (U’i,,+i)  have  been  col¬ 
lected.  Two  inter-level  coherence  protocols  have  been  considered  in  this  study, 
biitially,  a  homogeneous  solution  analogous  to  a  cache-only  (COMA)  protocol  [4] 
is  used  to  observe  the  inherent  locality  behavior  of  the  set  of  benchmarks  under 
study.  In  this  scenario,  a  block  can  migrate  to  any  level  of  the  hierarchy,  i.e., 
there  is  no  fixed  home  node  associated  with  a  given  memory  block.  Such  scenario 
is  referred  to  as  “migration  protocol”  in  the  rest  of  this  paper.  The  other  coher¬ 
ence  solution  considered  is  the  heterogeneous  protocol  introduced  in  Section  2.1. 
Such  scenario  is  referred  to  as  “inclusion  protocol"  m  the  rest  of  this  paper. 

.Subsection  4.1  presents  data  obtained  when  a  migration  protocol  with  unique 
line  sizes  across  the  hierarchy  is  assumed.  Such  scenario  is  used  as  a  basis  for 
the  analysis  of  (level)  data  locality.  Subsection  4.2  compares  the  results  obtained 
from  the  migration  scenario  to  results  obtained  when  the  coherence  protocol  en- 
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forcing  inclusion  presented  in  Section  2,1  is  used  and  line  sizes  are  allowed  to  be 
different  across  levels. 


4-1  Migration  protocol  with  unique  line  size  across  levels 

The  results  obtained  for  this  protocol  configuration  confirm  empirically  that 
applications  have  good  spatial  locality  with  respect  to  degree  of  parallelism,  in 
addition  to  previously  observed  [14]  temporal  locality  with  respect  to  degree  of 
parallelism.  Inter-level  miss  rates  are  low  for  small  line  sizes;  a  highest  miss  rate 
of  17.3%  occurs  in  Swim  for  16-Byte  line  sizes,  but  typical  values  are  around  1%. 
Furthermore,  inter-level  miss  rates  tend  to  decrease  as  the  line  size  gets  larger. 

Figure  3  shows  this  trend  for  the  benchmark  FL052,  assuming  a  4-levei 
HPAM.  The  figure  is  divided  into  four  sub-plots,  each  corresponding  to  an 
HPAM  level,  labelled  IvW  through  MS  in  the  x-axis.  Each  sub-plot  is  further 
divided  into  six  line  sizes,  ranging  from  Is  =  2^^  to  Is  =  2^^  bytes.  For  each  level 
and  line  size,  the  absolute  number  of  misses  is  plotted  in  the  y-axis  in  log  scale, 
and  the  corresponding  miss  rate  is  indicated  in  the  x-axis  between  parentheses. 
The  different  shades  of  the  bars  in  the  y-axis  correspond  to  the  percentage  of 
mter-level  misses  that  are  either  cold  misses  or  serviced  by  another  level.  To 
illustrate  this  notation,  consider  the  case  where  FL052  runs  on  a  4-level  HPAM 
with  line  size  of  2®  =  64  bytes  in  all  levels.  The  inter-level  miss  rates  for  levels  0 
through  3  are  3.40%,  0.35%,  0.20%  and  0.63%,  For  level  lvl2  and  line  size  of  64 
bytes,  approximately  half  of  the  inter-level  misses  are  serviced  by  level  3,  20%i  are 
serviced  by  level  1,  30%  by  level  0,  and  a  negligible  fraction  is  due  to  cold  misses. 


level, lg(lin»slze|,(%  misi  rale) 


Fig.  3.  Misses:  FL052,  4  levels,  protocol  C 


Figure  3  shows  that  the  spatial  locality  behavior  for  FLOSS  varies  across 
hierarchy  levels.  For  level  0,  the  inter-level  miss  rate  remains  approximately  con¬ 
stant,  slightly  degrading  as  the  line  size  increases.  In  contrast,  the  miss  rate 
decreases  about  two  orders  of  magnitude  as  line  size  increases  from  2'*  to  2^^  in 
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level  3.  Such  behavior  suggests  that  the  parallel  fraction  of  FL052  that  executes 
in  the  lowest  hierarchy  level  operates  on  large,  regular  data  structures  that  ben¬ 
efit  from  fetching  large  data  lines  on  a  miss. 

While  a  larger  line  size  tends  to  improve  inter-level  miss  rates,  it  also  tends 
to  increase  inter-level  data  traffic.  Figure  4  shows  how  data  traffic  between  adja¬ 
cent  farM*fPbgfi^lfiff»^{f^€r&?.<>!Sotice  that 

the  traffic  between  levels  0  and  1  in  this  case  increases  by  about  three  orders 
of  magnitude  for  the  range  of  line  sizes  considered,  while  the  traffic  between 
levels  2  and  3  increases  only  by  about  two  orders  of  magnitude  across  the  same 
line  size  range.  Such  behavior  can  be  explained  with  the  aid  of  the  inter-level 
miss  rate  profile  for  FL052  (Figure  3).  The  larger  line  sizes  brought  to  levels 
2  and  3  often  contain  data  that  is  likely  to  be  used  in  future  references,  while 
larger  line  sizes  in  levels  0  and  1  most  often  bring  data  that  remains  unused. 
Since  traffic  is  proportional  to  the  product  of  number  of  misses  and  line  size, 
if  the  number  of  misses  do  not  decrease  as  line  size  increases,  the  traffic  increases. 


lavels,  ig(linBaize) 


Fig.  4.  Traffic:  FL052,  4  levels,  protocol  C 


Table  1  summarizes  the  maximum  and  minimum  miss  rates  found  for  each 
benchmark  studied,  as  line  size  varies  from  2“*  to  2^^  bytes  in  an  HPAM  or¬ 
ganization  with  four  levels.  The  benchmarks  Stereo  and  Swim  have  only  three 
distinct  levels  of  parallelism  detected  by  Polaris,  and  FFT2  has  only  two.  For 
these  benchmarks,  the  HPAM  levels  that  do  not  generate  memory  references 
have  null  minimum  and  maximum  miss  rate  entries  in  Table  1. 

The  results  summarized  in  Table  1  show  good  spatial  locality  with  respect 
to  degree  of  parallelism  for  the  benchmarks  studied.  The  maximum  inter-level 
miss  rate  observed  for  the  lowest  level  of  the  hierarchy  is  0.9%;  for  the  topmost 
level,  the  maximum  inter-level  miss  rate  observed  is  17.3%  for  Siuim,  and  less 
than  or  equal  to  4.01%  for  all  other  benchmarks.  In  general,  the  best  miss  rates 
are  found  in  the  lowest  level  of  the  architecture,  where  loops  with  high  degree 
of  parallelism  are  executed. 
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Table  1.  Min/Max  inter-level  miss  rates  for  a  4-level  HPAM  configuration 


Inter-level  Miss  rate  | 

Benchmark 

Leve 

1  0 

Level  1 

Level  2 

Level  3 

mm 

max 

min 

max 

min 

max 

min 

max 

FL052 

2.66% 

3.40%, 

0.15% 

0.84% 

0,082% 

0.48%, 

0.0082% 

0.20%. 

TRFD 

0.12%, 

1.30%, 

5.28% 

6.97% 

0.87% 

1.08% 

0.046% 

0.086%, 

OCEAN 

0.096% 

1.63% 

0.27% 

1.70% 

0.023% 

0.45% 

MDG 

0.39% 

1.50% 

0.77% 

3.16% 

0.049% 

3.76% 

0.0012% 

0.097% 

ARC2D 

1.03% 

2.17% 

0.0043% 

1,45% 

0.0071% 

1.41% 

0.0004% 

0.062% 

Airshed 

0.72% 

0.049% 

0.049% 

0.018% 

0.037% 

Stereo 

0.021% 

1.48% 

- 

- 

0.049% 

1.51% 

0.011% 

0.12% 

Radar 

0.00039% 

0.066% 

0.27% 

1.63% 

0.035% 

3.79% 

0.0015%, 

0.90% 

FFT2 

0.00008% 

0.065%, 

■■i 

Hfll 

IHH 

HEHII 

0.0050% 

0.65% 

Hydro2d 

0.000.38%, 

0.0074%, 

Swim 

12.81%, 

17.30%] 

0.80%,  1 

9.48% 

0.0010% 

0.011% 

- 

- 

4.2  Inclusion  protocol 

Identical  line  sizes:  The  inclusion  protocol  was  initially  studied  assuming  that 
a  unique  line  size  is  used  across  all  HPAM  levels,  similar  to  the  migration  protocol 
discussed  in  Subsection  4.1.  Since  the  level  caches  are  a.ssumed  to  be  ideal, 
and  intra-level  sharing  is  not  modelled,  the  results  for  the  inclusion  coherence 
protocol  with  unique  line  size  do  not  differ  considerably  from  the  results  obtained 
from  the  migration  protocol  simulations,  in  general.  Assuming  such  idealized 
memory  model,  both  protocols  yield  measurements  that  characterize  the  inherent 
sharing  behavior  of  the  benchmarks.  The  inter-level  miss  rates  obtained  for  this 
scenario  differ  by  at  most  17%  from  the  migration  scenario  for  TEFD,  with  an 
average  difference  of  1.4%  across  all  benchmarks. 


Distinct  line  sizes:  Conventional  uniprocessor  cache  hierarchies  typically  use 
distinct  line  sizes  across  the  cache  levels;  large  lines  are  desirable  in  large  caches 
to  improve  miss  rates,  while  small  cache  lines  are  desirable  in  small  caches  to 
avoid  excessive  bandwidth  requirements  and  increases  in  miss  penalties  and  con¬ 
flict  misses  [5,13].  An  HPAM  machine  can  also  benefit  from  distinct  line  sizes 
across  levels  by  reducing  inter-level  traffic  while  not  sacrificing  inter-level  miss 
rates. 

The  inter-level  mi.ss  rate  and  traffic  profiles  for  the  benchmark  FL052  (Fig¬ 
ures  3  and  4)  illustrate  a  scenario  commonly  observed  in  the  simulations  per¬ 
formed,  where  a  large  line  size  effectively  reduces  the  inter-level  miss  rate  in 
level  3,  but  unneccessarily  increases  the  traffic  between  levels  0  and  1  without 
improving  the  inter-level  miss  rates  in  these  levels.  The  inter-level  inclusion  co¬ 
herence  protocol  described  in  Section  2.1  supports  a  configuration  with  distinct 
line  sizes  across  HPAM  levels;  the  effects  of  distinct  line  sizes  on  inter-level  miss 
rates  and  traffic  have  been  captured  quantitatively  through  simulation  and  are 


10 


300 


VECPAR’98  -  3rd  International  Meeting  on  Vector  and  Parallel  Processing 


discussed  in  this  subsection. 

Line  sizes  have  been  set  up  such  that  the  relationship  /s.+i  >  Is,  holds  for  any 
adjacent  levels  L?  +  1.  Hence,  levels  executing  highly  parallel  code  are  a.ssigned 
line  sizes  strictly  larger  than  levels  executing  moderately  parallel  or  sequential 
code.  Table  2  shows  how  inter-level  miss  rates  and  traffic  for  a  configuration 
with  multiple  line  sizes  compare  to  configurations  with  unique  line  sizes,  for  the 
benchmark  MDG.  The  first  row  of  Table  2  shows  inter-level  miss  rates  for  the 
multiple  line  size  configuration.  Rows  2  through  5  of  the  table  show  miss  rates 
obtained  in  four  different  simulations  with  unique  line  sizes,  eax:h  corresponding 
to  a  line  size  chosen  for  the  multiple  line  size  scenario.  The  remaining  rows  of 
Table  2  show  the  total  traffic  in  the  level  boundaries  for  three  scenarios:  multiple 
line  sizes,  unique  line  of  smallest  size  (2®  Bytes),  and  unique  line  of  largest  size 
(21^  Bytes). 


Table  2.  Traffic  and  inter-level  miss  rates  for  MDG  with  multiple  line  sizes:  2®.  2*.  2'® 
-  and  2'^ 


Level  0 
(ls=2®) 

Level  1 
(ls=2«) 

Level  2 
(^=2'°) 

Level  3 
(1s=2'M 

Miss  rate  (multiple) 

1.28% 

1.38% 

0.10% 

0.0012% 

Miss  rate  (unique,  2'') 

00 

d 

2.17% 

1.00%, 

0.025% 

.Miss  rate  (unique,  2“j 

o 

00 

1.44% 

0.29% 

0.0073%, 

Miss  rate  (unique,  2‘“j 

0.56%, 

0.98% 

0.10% 

0.0026%, 

Miss  rate  (unique,  2‘^j 

0.39% 

0.77% 

0.049%, 

0.0012% 

levels  0-1 

levels  1-2 

levels  2-3 

Traffic  (multiple) 

26.6GB 

12.9MB 

17.7MB 

- 

Traffic  (single,  ls=2'^ ) 

15.6GB 

10.8MB 

11.3MB 

- 

Traffic  (single,  ls=2‘‘' ) 

1.6TB 

71.8MB 

127.7MB 

- 

Table  2  shows  that  the  miss  rate  observed  in  the  multiple  line  size  scenario 
is  at  most  .56%  larger  than  the  rate  observed  for  the  corresponding  unique  line 
size  rate  (values  in  bold  face)  for  each  level.  For  levels  2  and  3,  in  particular,  the 
inter-level  miss  rates  are  equal  for  both  scenarios.  When  inter-level  traffics  are 
compared,  the  multiple  line  size  scenario  demands  about  an  order  of  magnitude 
less  traffic  than  the  unique  line  size  scenario  with  the  largest  line  size,  while 
demanding  no  more  than  70%.  more  traffic  than  the  unique  line  size  scenario 
with  the  smallest  line  size.  In  this  example,  the  multiple  line  size  configuration 
is  therefore  capable  of  providing  very  low  miss  rates  at  the  lowest  hierarchy- 
level  without  generating  excessive  traffic  in  the  upper  level  boundaries.  When 
a  unique  line  size  is  used,  either  the  miss  rate  in  the  lowest  level  or  traffic  in 
the  topmost  level  degrades.  The  same  motivations  for  using  multiple  line  sizes 
across  uniprocessor  memory  hierarchies  thus  apply  to  a  hierarchy  of  processor- 
and-memories:  maintaining  good  miss  rates  across  the  hierarchy  while  avoiding 
the  generation  of  unnecessary  traffic  in  the  upper  levels. 
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Table  3  shows  the  average  increase  in  the  inter-level  miss  rate  of  the  multiple 
line  size  configuration  compared  to  the  miss  rate  of  a  corresponding  unique  line 
size  configuration  for  simulations  performed  in  six  of  the  studied  benchmarks. 
The  inter-level  miss  rates  of  the  lowest  levels  in  the  hierarchy  remain  unchanged, 
with  respect  to  the  unique  line  size  scenario,  when  multiple  line  sizes  are  used. 
Miss  rates  at  the  topmost  level  increase  by  31%  in  average. 


Table  3.  Average  ratio:  missj-ate(multi)/miss-rate(single) 


Level  0 

Level  1 

Level  2 

Level  3 

Average 

1.31 

1.05 

1.01 

1.00 

5  Conclusions 

The  conclusions  reached  in  this  paper  provide  guidelines  to  the  design  of  the 
memory  and  network  subsystems  of  an  HPAM  machine.  The  implementation  of 
a  coherence  controller  that  supports  multiple  line  sizes  across  the  hierarchy  is 
an  ongoing  research  .subject.  The  inclusion  coherence  protocol  presented  in  this 
paper  has  been  used  as  a  proof  of  concept  to  study  the  advantages  of  fetching 
laigei  blocks  of  data  to  lower  levels  of  the  hierarchy  as  a  means  of  increasing 
spatial  locality  without  sacrificing  traffic  in  the  upper  levels  of  the  hierarchy.  One 
solution  under  investigation  that  may  require  minimal  modifications  to  the  exist¬ 
ing  directory  controllers  relies  on  hardware-assisted  prefetching.  In  this  .scheme, 
the  coherence  unit  size  is  kept  constant  acro,ss  the  hierarchy.  However,  lower 
hierarchy  levels  prefetch  larger  number  of  coherence  units  on  a  miss  than  upper 
levels.  Such  scheme  allows  reusing  of  cache  coherence  implementations  found  in 
homogeneous  machines. 

The  experimental  results  obtained  in  this  study  for  inter-level  miss  rates  among 
different  parallelism  levels  confirm  that  there  is  spatial  locality  with  respect  to 
degree  of  parallelism  in  parallel  applications,  in  addition  to  previously  observed 
temporal  locality.  The  differences  in  degrees  of  parallelism  and  memory  capacity 
a.cro.s.s  HPAM  levels  motivate  the  use  of  multiple  line  sizes  across  tiie  hierar¬ 
chy  as  a  means  of  reducing  inter-level  traffic  associated  with  large  line  sizes 
while  keeping  miss  rates  comparable  to  the  case  where  a  unique  line  size  is  u.sed 
across  all  levels  An  invalidation-based  inter-level  coherence  protocol  that  sup¬ 
ports  such  multiple  line  size  configuration  a.cross  processor-and-memory  levels 
has  been  proposed,  arid  the  experimental  results  obtained  with  simulations  using 
such  protocol  have  confirmed  that  more  balanced  inter-level  miss  rate  and  traffic 
characteristics  can  be  achieved  with  line  sizes  that  increase  from  the  top  to  the 
bottom  of  the  hierarchy.  A  distribution  with  larger  data  blocks  at  the  lowest  lev¬ 
els  of  the  hierarchy  is  consistent  with  the  proposed  HPAM  organization,  where 
lowest  levels  have  larger  amounts  of  memory. 

Idealized  caches  and  line  sizes  ranging  from  very  small  to  very  large  have 
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been  used  in  the  experiments  in  order  to  observe  the  inherent  locality  behavior 
of  the  studied  benchmarks.  The  authors  believe  that  the  overall  inter-level  lo¬ 
cality  behavior  in  systems  with  non-ideal  caches  can  be  derived  from  the  results 
obtained. 

An  HPAM  machine  combines  heterogeneity,  data  locality  with  respect  to  de¬ 
gree  of  parallelism  and  computing  near  memory  to  provide  a  cost-effective  solu¬ 
tion  to  high-performance  parallel  computing.  The  data  locality  studies  presented 
in  this  study  confirm  that  HPAM  machines  have  the  potential  to  competitively 
exploit  the  trend  towards  merging  processor  and  memory  technologies  and  the 
increasingly  more  powerful  but  also  extremely  expensive  fabrication  proces.ses 
needed  for  billion-transistor  chips. 


Appendix  A:  Inter-Level  Coherence  Protocol  Messages 


ULW(Bj(i))  //  UPPER-LVL  WRITEBACK 

LLR(Bj(i))  //  LOWER-LEVELS  READ 

for  all  level- (j-1)  sub-blocks 

temp  =  Bj (i) 

if  (sub-block  is  PART-INV)  then 

11 

ULW( sub-block) 

while  (temp  is  ’iNV) 

if  (sub-block  is  not  IKV)  then 

increment  L 

write-back  sub-block  from 

temp  =  level-L  superblock  of  temp 

level  j-1  to  level  j 

decrement  L 

state  (sub-block)  =  ACC 

while  (L  >=  j) 

state  (Bj (i) )  =  ACC 

read  level-L  superblock 

from  level  L+1 

state (level-L  superblock)  =  ACC 

decrement  L 

ULI(Bj(i))  //  UPPER-LVL  INVALIDATE 

LLP(Bj(i))  //  LOWER-LEVELS 

if  (j  is  not  the  first  level) 

//  PARTIAL-INVALIDATE 

for  all  level- (j-1)  sub-blocks 

temp  =  superblock  of  Bj(i) 

if  (sub-block  is  not  INV) 

L  =  j+2 

state (sub-block)  =  INV 

while  (temp  is  not  PART-INV) 

ULI (sub-block) 

state (temp)  =  PART-INV 
increment  L 

temp=level-L  superblock  of  temp 

READ(Bj(i))  //  MEMORY  READ 

WRITE(Bj(i))  //  MEMORY  WRITE 

if  (Bj(i)  is  INV) 

if  (Bj(i)  is  RES  or  ACC) 

LLR(Bj(i)) 

ULI(Bj(i)):  LLP(Bj(i)); 

state (Bj (i) )  =  ACC 

if  (Bj(i)  is  PART-INV) 

if  (Bj(i)  is  PART-INV)  then 

ULW(Bj(i));  ULI(Bj(i)); 

ULW(Bj(i)) 

if  (Bj(i)  is  INV) 

read  Bj(i)  from  level  j 

LLR(Bj(i)):  LLP(Bj(i)): 

state (Bj (i))  =  RES 

write  Bj(i)  to  level  j 
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